Get a literature dataset

ireneisdoomed / phenomena

Inspect what phenotypes are associated with a disease

https://ireneisdoomed-phenomena-app-pu8o79.streamlit.app/

MIT License

1 stars 0 forks source link

Get a literature dataset #1

Open ireneisdoomed opened 1 year ago

ireneisdoomed commented 1 year ago

If we want to mine the literature we want to avoid working with the XML raw data and get

Pubmed abstracts text
Pubmed open full text articles in a way that we can rapidly start prototyping models without having to parse the text.

How?

To mine the literature the idea is to use the PUBMED dataset as a base, which contains over 30 million scientific articles and is made available in the HuggingFace platform here: https://huggingface.co/datasets/pubmed.

As an alternative, and if the missing data is too large, we can use The Pile dataset.

ireneisdoomed commented 1 year ago

The code to load the literature data from the Hub is

from datasets import load_dataset

dataset = load_dataset("pubmed")

However this was taking hours to download, I thought it was due to issues on my end but TIL that the repo in the Hub does not actually host the data, but is a downloader and parser to the data. So the literature that is extracted are the XML files that the NIH makes available in their FTP: https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/

And this is the code that downloads and processes all these data into a dictionary format: https://huggingface.co/datasets/pubmed/blob/main/pubmed.py

Currently trying to find what is the quickest way to inspect the data.

ireneisdoomed commented 1 year ago

This is currently running.

Until this step is sorted, we will use the Open Targets matches dataset, which hosts the recognised entities and their IDs per publication. This way we will be able to jump directly to #5

(Link to matches: http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/latest/output/etl/parquet/literature/matches/)

ireneisdoomed commented 1 year ago

I have filtered the matches dataset to only include the disease matches:

136,814,572 mapped disease references across all literature
10,835 unique diseases - which means that our maximum coverage will be less than 50% of the whole ontology (25k terms)

# code to reproduce 

matches = spark.read.parquet("gs://open-targets-pre-data-releases/23.02/output/etl/parquet/literature/matches")
matched_diseases = matches.filter(f.col("type") == "DS").selectExpr("pmid", "pmcid", "text", "label", "keywordId as efo_id")