ndexbio / gsoc_llm

GSOC 2024 LLM Project
MIT License
0 stars 0 forks source link

Test Extraction on SIRT1 Publication (Has Many Interactions) #2

Closed cannin closed 3 months ago

cannin commented 4 months ago

This review is a good test case because it has many interactions ~ 90 interactions

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3898398/

The "Additional file 2" (at bottom) has the interactions included in the publication. Work to retrieve as many as possible and report how many you are retrieving. I expect you try to get at least ~70 interactions before reporting.

Favourj-bit commented 4 months ago

Another document to try out that came up during the kickoff meeting

https://pubmed.ncbi.nlm.nih.gov/28404643/

cannin commented 4 months ago

@Favourj-bit That article happened to be in the examples; this is not an important for this project. If you are in need more articles, use the ones in the zip file (PMIDs provided); this is related to #7

indra_adrenocortical_carcinoma_v2.zip

Another document to try out that came up during the kickoff meeting

https://pubmed.ncbi.nlm.nih.gov/28404643/

Favourj-bit commented 4 months ago

@cannin Alright, noted. Thank you

Favourj-bit commented 4 months ago

Hello @cannin, so i tested out langchain on the paper and I was able to parse through the whole document. There is just something i need clarity on. I tried to get the number of interactions and number of unique interactions, I don't know if I should be getting unique interactions since all interactions are supposed to be a different one

Screenshot 2024-05-13 at 19 10 42

The first result is what I got without specifying anything about uniqueness and it reports 158 interactions The other one is what I got when I tried to specify uniqueness.

I have attached the json files that shows the different outputs here: https://github.com/ndexbio/gsoc_llm/tree/094a972ab9bb84009db73b1cc84721fc6337a6ff/results/SIRT1_PARP1

cannin commented 4 months ago

@Favourj-bit as discussed the review article is very dense with interactions; try working with PMC6044858 first

Favourj-bit commented 4 months ago

@cannin , I have been able to test the other paper you suggested using gpt. However, I am having issues with coming up with code to compare the both because they seem to be represented a little differently. I don't know if you might have any suggestion for me to write a comparison code. I have attached some screenshots that shows what i am talking about

Screenshot 2024-05-18 at 04 17 39 Screenshot 2024-05-18 at 04 05 50 Screenshot 2024-05-18 at 04 06 03
cannin commented 4 months ago

No easy fix for you, but that's why I said use/make a format compatible with both NDEX and INDRA (see #7). Remember, important to know what matches or not; not just an overall X INDRA and Y GPT counts.

Favourj-bit commented 4 months ago

@cannin I have been able to extract interactions from this document based on sentences. I noticed that the extraction chain does not still extract the sentence as the interaction details upon directly specifying it in the prompt and in the schematic. This is the result from the code: https://github.com/ndexbio/gsoc_llm/blob/main/results/pmc6044858/sentence_output.json

cannin commented 4 months ago

Don't trust or ask GPT to get the sentence. Get the result, if it is JSON, then dump it into a dictionary variable in Python and add the sentence.