rahuln / lm-bio-kgc

Using pretrained language models for biomedical knowledge graph completion.
46 stars 7 forks source link

Preprocess datasets and create triplets with (head, relation, tail) #4

Closed giuliacassara closed 4 months ago

giuliacassara commented 2 years ago

Hi Rahul, many thanks for your quick responses! I would like to recreate the processed files myself, by using the scripts in data/script. Also, I want to create for msi and hetionet a triplet file with explicit reference to the relationship (I know that you are not supporting this feature, it's from my initiative) . When I launch get_description_msi.py the script expects in the arguments _msi_file, go_file, entrezfile, which I don't have. The same for preprocess_msi.py, which expect a directory which is the location of msi files. Looking in depth at your code

files = {('drug', 'protein') : '1_drug_to_protein.tsv', ('disease', 'protein') : '2_indication_to_protein.tsv', ('protein', 'protein') : '3_protein_to_protein.tsv', ('protein', 'function') : '4_protein_to_biological_function.tsv', ('function', 'function') : '5_biological_function_to_biological_function.tsv', ('drug', 'disease') : '6_drug_indication_df.tsv'}

I saw that these files are what I really need to build my triplets files, although I don't know how you created them. Can you please send me these files or tell me how I can reproduce them?

rahuln commented 2 years ago

The .tsv files you mentioned above are part of the multiscale interactome data, which you can download from this Github repo. This provides the directory that you specify when running preprocess_msi.py. For the get_descriptions_msi.py script, msi_file is the output of running preprocess_msi.py, go_file is the Gene Ontology OBO file which you can download from this link, and entrez_file is a JSON file with a dictionary that maps Entrez protein IDs to their descriptions for all the proteins in msi. You can scrape these descriptions from Entrez using BioPython, but I've attached the file used to construct the MSI dataset here.