nrnb / GoogleSummerOfCode

Main documentation site for NRNB GSoC project ideas and resources
114 stars 38 forks source link

Utilize the T5 Text-to-Text Transformer to Generate Descriptions of Interactions from Pathway Commons using WebNLG #196

Closed cannin closed 1 year ago

cannin commented 2 years ago

Background

Data structured in databases is very useful for many applications, but text is a primary way that humans communicate on the internet. Being able to present query results as sentences can help humans understand rich datasets.

Data

Pathway Commons (http://pathwaycommons.org/) is an aggregated database of molecular interactions of millions of interactions. Data stored in the Pathway Commons is in the BioPAX (http://biopax.org/) XML-based format. The data is aggregated from a collection of approximately 20 databases. Data from Pathway Commons is accessible via the Java Paxtools (https://biopax.github.io/Paxtools/) library.

Algorithm

Neural network transformers such as T5 (https://huggingface.co/docs/transformers/model_doc/t5) are able to convert input text-based content into other forms, such as sentences. Challenges such as WebNLG (https://webnlg-challenge.loria.fr/) have sought to standardize input representations of text content for use with transformers for the generation of text.

Goal

The goal will be to generate code to:

  1. Structure content from Pathway Commons into representations for use with transformers and generate text (as sentences) using the transformers.
  2. Run needed model fine-tuning
  3. Run text generation
  4. Produce metrics to evaluate the generated output (e.g., BLEU score; https://en.wikipedia.org/wiki/BLEU)
  5. Develop an interface to explore results and try out models

Difficulty Level: Medium

Size and Length of Project

Size: 175 hours Length: 12 weeks

Skills

List skills/technologies that the student should be familiar with. Also tag the issue with these.

Essential skills: Python, PyTorch Nice to have skills: Java (basics)

Public Repository

Potential Mentors

Augustin Luna

Kartikkp07 commented 2 years ago

Hi @cannin , I am Kartik Kumar Pawar, a CSE sophomore at BITS PILANI. I have good experience using python for about 6 years.I am also adept in JAVA with knowledge of OOPS and basic design patterns,I have also worked with both SQL and NoSQL database systems.I am familiar with javascript and have worked with React and nodeJS as well. I am really excited to know more about this project and contribute to it, with the aim of becoming a GSOC 22 contributor as well. I kindly request you to guide me for the same so I can start as soon as possible.

GunjanDhanuka commented 2 years ago

Hi @cannin ! I am interested in this project, having used PyTorch and Tensorflow extensively for my projects. I also have strong knowledge of Python as I have done a Research Internship involving the use of Python for data analysis. I have made websites with Django as well, and know the basics of Java with good knowledge of OOPS.

Should we directly submit a proposal using the Google Doc template or do we have to pass evaluation tests as well?

khanspers commented 2 years ago

NRNB has officially been accepted as a mentoring organization for GSoC 2022! Here are some useful links:

cannin commented 2 years ago

@GunjanDhanuka @Kartikkp07 Apologies for the delay. If you are still interested the next steps are to work on a proposal. The proposal should be as detailed as possible: what data transformations will be needed, example code for using the transformer, plus the various points listed under the "Goal" section.

Kartikkp07 commented 2 years ago

Sure!!Thanks for the update, will start working on it asap.

khanhcodes commented 2 years ago

My name is Kaitlyn and I am a first-year undergraduate student majoring in Computer Science and Mathematics at the University of Georgia. I am very interested in joining NRNB for the summer and making contributions to the code. In terms of programming experience, I am familiar with Java, Python, and JavaScript. For bioinformatics data analysis, my skills/techniques include data cleaning, data visualization, Numpy, Pandas, skicit-learn, PyTorch, and TensorFlow. I have experience working in a research lab where I analyze genome data in crop plants and make a classifier for single-cell genomes for better gene expression regulation using skicit-learn. I am interested in helping utilize the T5 Text-to-Text Transformer to generate descriptions of interactions from pathway commons, as I think my skill set would fit this perfectly. I am very excited about the applications of computer science in biology, and I am looking forward to being a part of your team for the summer. Please let me know if I can work with you!

khanspers commented 2 years ago

A reminder that the application period opens on Monday April 4. Proposals to NRNB must be submitted on the official GSoC Site (https://summerofcode.withgoogle.com/) before April 19, 18:00 UTC to be considered, and contributors are encouraged to submit proposals in draft format early, so that mentors can give feedback directly at the GSoC site.

cannin commented 2 years ago

@khanhcodes and anyone else if you are interested and have worked on a proposal for this project. i'm willing to give comments before the deadline. create a google doc and send via email (see Potential Mentors section).

AlexanderPico commented 2 years ago

IMPORTANT REMINDER: GSoC 2022 is for new “beginners” to open source.

Applicants are expected to review eligibility requirements prior to applying. We can not accept applications from contributors with prior open source development experience. From the GSoC FAQ https://developers.google.com/open-source/gsoc/faq:

Can someone already participating in open source be a GSoC Contributor?

The goal of GSoC is to bring new contributors into open source organizations. GSoC can also help beginner contributors learn the ins and outs of open source while being mentored by experienced community members. GSoC is for new and beginner contributors to open source, it is not for experienced contributors to open source.

khanspers commented 1 year ago

Closing in preparation for GSoC 2023.