nrnb / GoogleSummerOfCode

Main documentation site for NRNB GSoC project ideas and resources
114 stars 38 forks source link

Large Language Model-based creation of knowledge graph from a publication #235

Closed dexterpratt closed 2 months ago

dexterpratt commented 5 months ago

Background

This is a new project, with the goal of creating a Python application that creates a network of interactions based on LLM analysis of a published paper. Statements of relationships in the paper are identified and then expressed using standardized interaction and entity types. The LLM analysis also creates text explaining each interaction and distinguishes (1) background information presented in the paper from (2) the original findings of the paper.

Goal

Create a Python application: Input: academic paper that presents findings that include molecular interactions Output: a network in which the nodes are biological entities and edges represent interactions and other relationships between the entities Format: CX2, https://cytoscape.org/cx/cx2/specification/cytoscape-exchange-format-specification-(version-2)/ Action: upload to NDEx https://www.ndexbio.org/index.html#/ NDEx API is accessed via https://pypi.org/project/ndex2/

(note that the network in NDEx will be editable in Cytoscape www.cytoscape.org)

Examples interactions: AKT1 binds GSK3B Activation of AKT1 can increase Cell Proliferation

The interactions in the network are created by LLM-based analysis in which statements of relationships in the paper are identified and then expressed using standardized interaction and entity types. The LLM analysis also creates text explaining each interaction and distinguishes (1) background information presented in the paper from (2) the original findings of the paper.

Difficulty Level: Medium

Medium, Hard for "stretch goals" The programming for this problem is medium-level The biological knowledge required can be basic.

However, there will be many opportunities to go beyond the minimal requirements. For example, if participants have more advanced biological knowledge, they might generate graphs expressing much more complex interactions.

Size and Length of Project

large: 350 hours 12+ weeks

Skills

Essential skills: Python

Nice to have skills:

Public Repository

TBD, will be created in the Cytoscape organization

Potential Mentors

Dexter Pratt, Jing Chen

Foxtrot-14 commented 5 months ago

Hello, I'm Noaman, a CS undergrad from the class of 2025. I've worked on issue 223, and I find this project impactful. I'd love to continue contributing to it.

  1. To kick off, I plan to delve into sample papers that feature graphs. This will help me understand what prompts to provide to the LLM. (If you have any suggestions for reference papers, please share.)

  2. The next step will be configuring the LLM. I have some exposure to the GPT-3.5 API, but feeding the document to the LLM presents a challenge. The first obstacle that comes to mind is the word limit for prompts (15,000 characters). If we provide the document in chunks, we'll be making multiple API requests per document, increasing the cost.

Let's discuss these points.

I'm eager to work on this project for GSOC 2024. I've also read about your work on NDEx, and it's fascinating. Looking forward to hearing from you.

khanspers commented 4 months ago

NRNB has been accepted as a mentoring organization for GSoC 2024. The contributor application period is March 18 – April 2. Here are some useful links:

GSoC contributor guide NRNB project proposal template Eligibility requirements Full program timeline

dexterpratt commented 4 months ago

The biology is probably the steepest slope for coming up to speed. The problem is that you would need to be able to have some idea of whether the extraction is producing sense or nonsense as you develop

From: Abdul Mateen Mulla @.> Date: Sunday, February 25, 2024 at 2:12 AM To: nrnb/GoogleSummerOfCode @.> Cc: Dexter Pratt @.>, Mention @.> Subject: Re: [nrnb/GoogleSummerOfCode] Large Language Model-based creation of knowledge graph from a publication (Issue #235)

Hi @dexterpratthttps://urldefense.com/v3/__https:/github.com/dexterpratt__;!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__wShdgM0A$ @jingjingbichttps://urldefense.com/v3/__https:/github.com/jingjingbic__;!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__zZTTQRzQ$ I'm interested in Large Language Model-based creation of knowledge graph from a publication project . My Skills with regard to Project Hands-on: Python, Langchain/LLM Areas to Explore: Cytoscape, NDEx, Graphs, Biology, MiTAB

Do you think this will be enough for me to grasp the project's working over the next two months before the submission? I'm open to diving deeper into these skills if needed.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/nrnb/GoogleSummerOfCode/issues/235*issuecomment-1962876185__;Iw!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__ya2JkLkg$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AAIOVLCUP5GSEUTM4JB32MTYVMCNDAVCNFSM6AAAAABCA5VW6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSHA3TMMJYGU__;!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__zFyqIZcA$. You are receiving this because you were mentioned.Message ID: @.***>

Galvanized-Heart commented 4 months ago

Hey there, @dexterpratt my name is Maxim Kirby. I did my B.Sc. in Biochemistry at the University of Waterloo and graduated in 2023. I've been coding for about a year in Python and I've been getting really into deep learning since I am hoping to graduate work in deep learning for protein engineering. I saw this project in Google's Summer of Code and I thought I could really apply my skills well here becuase I do have a strong biology and chemistry background. I'd still need to brush up on Cytoscape, NDEx, knowledge graphs for this application, and MiTAB. I think I’ll still dig around and try to make some smaller contributions first though!

Yayi0117 commented 3 months ago

Hello, @dexterpratt @jingjingbic, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program. After going through related materials, I thought that the main challenge of this program is to write proper prompts to use the LLM to generate right entities and relations, and my preliminary idea is:

By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction.

How feasible of this proposal? I would appreciate any guidance from you.

Favourj-bit commented 3 months ago

Hello @dexterpratt @jingjingbic I hope this message finds you well. My name is Favour James, and I recently came across your project on the list of ideas for GSOC. I am particularly interested in this issue and I would like to contribute to this project during the GSOC'24 program. I participated under NRNB last year and worked on this project: https://github.com/nrnb/GoogleSummerOfCode/issues/217

Before I begin my application, I have a few questions I hope you could clarify:

  1. Are there specific areas within the project’s domain you recommend I focus on to better prepare myself before the project starts?
  2. Are there some getting started tasks you would suggest I do to get familiar with the project before the contribution phase starts?
  3. Is there a preferred development environment or setup for working on this project?
  4. Do I need domain expertise in knowledge graphs before i can contribute to this project?
  5. Could you suggest some academic papers that you would be considering during the project phase so I can start looking and experiment with the LLM

    Thank you for your time and attention, and I look forward to hearing back from you soon.

dexterpratt commented 3 months ago

"To kick off, I plan to delve into sample papers that feature graphs. This will help me understand what prompts to provide to the LLM. (If you have any suggestions for reference papers, please share.)"

It's not the papers that are about graphs. Rather, the idea is that we use an LLM to understand a paper that reports interactions between biological entities and then express that understanding as a knowledge graph.

dexterpratt commented 3 months ago

"The next step will be configuring the LLM. I have some exposure to the GPT-3.5 API, but feeding the document to the LLM presents a challenge. The first obstacle that comes to mind is the word limit for prompts (15,000 characters). If we provide the document in chunks, we'll be making multiple API requests per document, increasing the cost."

We will handle setting up a service to use for the project which will in turn provide access to multiple LLMs. The word limits are much larger than 15K now, but it is an open question as to whether you get better results chunking or not.

dexterpratt commented 3 months ago

We don't have a specific area of biology in mind. But, now that you ask, we are doing some LLM work in the host response to viral infection. So that might be a nice overlap. But any paper that discusses interactions such as described in the original post is a reasonable input.

We typically would be using VS code with GitHub CoPilot in anaconda environments.

Prior experience with knowledge graphs isn't a requirement. Experience with simpler biological networks, as described in the original post, is important.

Getting started tasks - simply try prompting one of the LLMs in a public interface (i.e. ChatGTP) with text from a paper with instructions to extract the important relationships as a knowledge graph. See how far the naive experiment gets you.

Favourj-bit commented 3 months ago
  • "Are there specific areas within the project’s domain you recommend I focus on to better prepare myself before the project starts?
  • Are there some getting started tasks you would suggest I do to get familiar with the project before the contribution phase starts?
  • Is there a preferred development environment or setup for working on this project?
  • Do I need domain expertise in knowledge graphs before i can contribute to this project?
  • Could you suggest some academic papers that you would be considering during the project phase so I can start looking and experiment with the LLM"

We don't have a specific area of biology in mind. But, now that you ask, we are doing some LLM work in the host response to viral infection. So that might be a nice overlap. But any paper that discusses interactions such as described in the original post is a reasonable input.

We typically would be using VS code with GitHub CoPilot in anaconda environments.

Prior experience with knowledge graphs isn't a requirement. Experience with simpler biological networks, as described in the original post, is important.

Getting started tasks - simply try prompting one of the LLMs in a public interface (i.e. ChatGTP) with text from a paper with instructions to extract the important relationships as a knowledge graph. See how far the naive experiment gets you.

Thank you for your response. Another thing I am trying to understand is what the format and action under the goals refer to. FInally, please can I add you to my proposal draft once i start working on it?

dexterpratt commented 3 months ago

This is not a specific proposal

From: Yayi0117 @.> Date: Sunday, March 10, 2024 at 9:50 PM To: nrnb/GoogleSummerOfCode @.> Cc: Dexter Pratt @.>, Mention @.> Subject: Re: [nrnb/GoogleSummerOfCode] Large Language Model-based creation of knowledge graph from a publication (Issue #235)

Hello, @dexterpratthttps://github.com/dexterpratt @jingjingbichttps://github.com/jingjingbic, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program. After going through related materials, I thought that the main challenge of this program is to write proper prompts to use the LLM to generate right entities and relations, and my preliminary idea is:

By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction.

How feasible of this proposal? I would appreciate any guidance from you.

— Reply to this email directly, view it on GitHubhttps://github.com/nrnb/GoogleSummerOfCode/issues/235#issuecomment-1987489589, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIOVLEFXKJNX4O6O72DDYLYXUEYBAVCNFSM6AAAAABCA5VW6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGQ4DSNJYHE. You are receiving this because you were mentioned.Message ID: @.***>

Yayi0117 commented 3 months ago

This is not a specific proposal From: Yayi0117 @.> Date: Sunday, March 10, 2024 at 9:50 PM To: nrnb/GoogleSummerOfCode @.> Cc: Dexter Pratt @.>, Mention @.> Subject: Re: [nrnb/GoogleSummerOfCode] Large Language Model-based creation of knowledge graph from a publication (Issue #235) Hello, @dexterpratthttps://github.com/dexterpratt @jingjingbichttps://github.com/jingjingbic, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program. After going through related materials, I thought that the main challenge of this program is to write proper prompts to use the LLM to generate right entities and relations, and my preliminary idea is: By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction. How feasible of this proposal? I would appreciate any guidance from you. — Reply to this email directly, view it on GitHub<#235 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIOVLEFXKJNX4O6O72DDYLYXUEYBAVCNFSM6AAAAABCA5VW6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGQ4DSNJYHE. You are receiving this because you were mentioned.Message ID: @.***>

Thanks for your reply. Apart from iteratively refining prompts through human feedback, I haven't figured out a more specific plan to guide the LLM to generate correct entities and relations. Could you provide some guidance on this?

Foxtrot-14 commented 3 months ago

@dexterpratt, I have shared my GSOC proposal for early feedback (to your ucsd email id which I found on the IDEKER LAB website). Kindly take a look and let me know if any changes are needed.

khanspers commented 2 months ago

This is an active GSoC 2024 project. Closing this project idea as it is no longer available to other contributors.