Open Alx-Kouris opened 1 year ago
Hi @Alx-Kouris, Thank you for submitting a question to the Monarch HelpDesk! We sincerely appreciate you taking the time to reach out and are working on an answer to share with you soon.
Hi @Alx-Kouris,
Unfortunately, I think what you're seeing is just the age of the data on our production site and in our production graph.
We're working to rebuild our stack, starting from the graph and moving up through the API and website, so I can see that the data is present, but we've got a few months before we'll be showing it.
Pulling the development sqlite database artifact from https://data.monarchinitiative.org/monarch-kg-dev/latest/index.html I can see:
sqlite3 -markdown monarch-kg.db "select subject, predicate, object from edges where subject = 'HGNC:555' and predicate = 'biolink:has_phenotype'"
subject | predicate | object |
---|---|---|
HGNC:555 | biolink:has_phenotype | HP:0001252 |
HGNC:555 | biolink:has_phenotype | HP:0000286 |
HGNC:555 | biolink:has_phenotype | HP:0000316 |
HGNC:555 | biolink:has_phenotype | HP:0030953 |
HGNC:555 | biolink:has_phenotype | HP:0001274 |
HGNC:555 | biolink:has_phenotype | HP:0000007 |
HGNC:555 | biolink:has_phenotype | HP:0000767 |
HGNC:555 | biolink:has_phenotype | HP:0001263 |
HGNC:555 | biolink:has_phenotype | HP:0000718 |
HGNC:555 | biolink:has_phenotype | HP:0001249 |
HGNC:555 | biolink:has_phenotype | HP:0001250 |
HGNC:555 | biolink:has_phenotype | HP:0000358 |
HGNC:555 | biolink:has_phenotype | HP:0001257 |
HGNC:555 | biolink:has_phenotype | HP:0000369 |
HGNC:555 | biolink:has_phenotype | HP:0001388 |
HGNC:555 | biolink:has_phenotype | HP:0000336 |
HGNC:555 | biolink:has_phenotype | HP:0000218 |
HGNC:555 | biolink:has_phenotype | HP:0004626 |
HGNC:555 | biolink:has_phenotype | HP:0000750 |
HGNC:555 | biolink:has_phenotype | HP:0001250 |
HGNC:555 | biolink:has_phenotype | HP:0004209 |
HGNC:555 | biolink:has_phenotype | HP:0000716 |
HGNC:555 | biolink:has_phenotype | HP:0001252 |
HGNC:555 | biolink:has_phenotype | HP:0002007 |
HGNC:555 | biolink:has_phenotype | HP:0001249 |
HGNC:555 | biolink:has_phenotype | HP:0000262 |
HGNC:555 | biolink:has_phenotype | HP:0000739 |
HGNC:555 | biolink:has_phenotype | HP:0001263 |
HGNC:555 | biolink:has_phenotype | HP:0011220 |
HGNC:555 | biolink:has_phenotype | HP:0000343 |
HGNC:555 | biolink:has_phenotype | HP:0000565 |
HGNC:555 | biolink:has_phenotype | HP:0030820 |
HGNC:555 | biolink:has_phenotype | HP:0000722 |
HGNC:555 | biolink:has_phenotype | HP:0002942 |
HGNC:555 | biolink:has_phenotype | HP:0000750 |
HGNC:555 | biolink:has_phenotype | HP:0000768 |
HGNC:555 | biolink:has_phenotype | HP:0012803 |
HGNC:555 | biolink:has_phenotype | HP:0000718 |
HGNC:555 | biolink:has_phenotype | HP:0002938 |
HGNC:555 | biolink:has_phenotype | HP:0000006 |
HGNC:555 | biolink:has_phenotype | HP:0009381 |
HGNC:555 | biolink:has_phenotype | HP:0004691 |
HGNC:555 | biolink:has_phenotype | HP:0001257 |
HGNC:555 | biolink:has_phenotype | HP:0001763 |
HGNC:555 | biolink:has_phenotype | HP:0100716 |
HGNC:555 | biolink:has_phenotype | HP:0000752 |
HGNC:555 | biolink:has_phenotype | HP:0000646 |
HGNC:555 | biolink:has_phenotype | HP:0000729 |
Hopefully that at least looks good.
Thank you for pointing out the discrepancy and submitting an issue, and hopefully we'll at least have a beta for the new API & site to look at soon.
Thank you for the quick response @kevinschaper. However, we (I work with Alex who asked the question here) deal with thousands of genes in a high throughput way. Is there a way for us to download the correct data? We have been downloading from https://data.monarchinitiative.org/latest/tsv/gene_associations/index.html but if those data are not reliable, do we have another option?
Hi @chapplec,
We aren't producing those nice association subset files from the new pipeline yet, but we do plan to.
You can get all associations in tsv format from monarch-kg_edges.tsv within https://data.monarchinitiative.org/monarch-kg-dev/latest/monarch-kg.tar.gz, and then subset on the category
field for biolink:GeneToDiseaseAssociation
- and you may also want to subset on the predicate
field as well.
The new graph is intentionally more limited in gene to disease associations (currently only data from OMIM) and has predicates (in biolink, which is equivalent to relation in the OBAN model) that are more accurate / cautious, in particularly with respect to claims of causation.
I know that I prefer to use delimited files for pipelines, but I'm going to go back to the sqlite database again for quick subsetting:
Quickly, these are the two predicates we're using. biolink:risk_affected_by is the stronger assertion.
sqlite3 -markdown monarch-kg.db "select distinct predicate from edges where category = 'biolink:GeneToDiseaseAssociation'"
predicate |
---|
biolink:gene_associated_with_condition |
biolink:risk_affected_by |
You can get them together with
sqlite3 -markdown monarch-kg.db "select subject, predicate, object from edges where category = 'biolink:GeneToDiseaseAssociation' limit 10"
subject | predicate | object |
---|---|---|
HGNC:2593 | biolink:gene_associated_with_condition | MONDO:0008730 |
HGNC:2593 | biolink:gene_associated_with_condition | MONDO:0008730 |
HGNC:26404 | biolink:gene_associated_with_condition | MONDO:0014464 |
HGNC:91 | biolink:gene_associated_with_condition | MONDO:0012392 |
HGNC:21024 | biolink:gene_associated_with_condition | MONDO:0010117 |
HGNC:29092 | biolink:gene_associated_with_condition | MONDO:0013039 |
HGNC:25367 | biolink:gene_associated_with_condition | MONDO:0013627 |
HGNC:6936 | biolink:gene_associated_with_condition | MONDO:0008861 |
HGNC:6937 | biolink:gene_associated_with_condition | MONDO:0008862 |
HGNC:4799 | biolink:gene_associated_with_condition | MONDO:0017715 |
You likely a way that you'd prefer to subset the tsv files as a part of a pipeline, but just to show it quickly as an sqlite3 one liner, and I'll attach the file:
sqlite3 monarch-kg.db -cmd ".mode tabs" -cmd ".headers on" "select * from edges where category = 'biolink:GeneToDiseaseAssociation'" > gene_disease.tsv
Finally, we are still in an awkward position between the old system, which is becoming outdated and the new, which is under development and still naturally has bugs to be discovered. Unfortunately, I noticed that within that file I attached, there are 127 rows where there is an HGNC curie in the disease column (subject, for these associations). I created an issue for this bug, and we'll get it fixed ASAP.
Sorry, @kevinschaper I just saw this! We'll have a look and see if we can work with what you've given us. Thanks for responding!
I can give a little bit of an update on the odd G2D associations too, we have both MONDO to MONDO associations getting created as gene-to-disease as well as HGNC to HGNC. That investigation is happening in https://github.com/monarch-initiative/monarch-app/issues/721.
One thing that you can do with those records is look at the orignal_subject
& original_object
columns, which shows the step back before it goes through our ID mapping process - but probably the safest thing to do is exclude them.
Hello, I visited the Monarch website for gene AP1G1 https://monarchinitiative.org/gene/HGNC:555#phenotype and I see only some EFO phenotypes being reported.
I would assume to find HPO phenotypes related, based on this page https://hpo.jax.org/app/browse/gene/164.
Is this expected? Or some kind of bug?