monarch-initiative / helpdesk

The Monarch Initiative Helpdesk
BSD 3-Clause "New" or "Revised" License
7 stars 0 forks source link

Missing phenotypes for gene AP1G1 #71

Open Alx-Kouris opened 1 year ago

Alx-Kouris commented 1 year ago

Hello, I visited the Monarch website for gene AP1G1 https://monarchinitiative.org/gene/HGNC:555#phenotype and I see only some EFO phenotypes being reported.

I would assume to find HPO phenotypes related, based on this page https://hpo.jax.org/app/browse/gene/164.

Is this expected? Or some kind of bug?

sagehrke commented 1 year ago

Hi @Alx-Kouris, Thank you for submitting a question to the Monarch HelpDesk! We sincerely appreciate you taking the time to reach out and are working on an answer to share with you soon.

kevinschaper commented 1 year ago

Hi @Alx-Kouris,

Unfortunately, I think what you're seeing is just the age of the data on our production site and in our production graph.

We're working to rebuild our stack, starting from the graph and moving up through the API and website, so I can see that the data is present, but we've got a few months before we'll be showing it.

Pulling the development sqlite database artifact from https://data.monarchinitiative.org/monarch-kg-dev/latest/index.html I can see:

sqlite3 -markdown monarch-kg.db "select subject, predicate, object from edges where subject = 'HGNC:555' and predicate = 'biolink:has_phenotype'"
subject predicate object
HGNC:555 biolink:has_phenotype HP:0001252
HGNC:555 biolink:has_phenotype HP:0000286
HGNC:555 biolink:has_phenotype HP:0000316
HGNC:555 biolink:has_phenotype HP:0030953
HGNC:555 biolink:has_phenotype HP:0001274
HGNC:555 biolink:has_phenotype HP:0000007
HGNC:555 biolink:has_phenotype HP:0000767
HGNC:555 biolink:has_phenotype HP:0001263
HGNC:555 biolink:has_phenotype HP:0000718
HGNC:555 biolink:has_phenotype HP:0001249
HGNC:555 biolink:has_phenotype HP:0001250
HGNC:555 biolink:has_phenotype HP:0000358
HGNC:555 biolink:has_phenotype HP:0001257
HGNC:555 biolink:has_phenotype HP:0000369
HGNC:555 biolink:has_phenotype HP:0001388
HGNC:555 biolink:has_phenotype HP:0000336
HGNC:555 biolink:has_phenotype HP:0000218
HGNC:555 biolink:has_phenotype HP:0004626
HGNC:555 biolink:has_phenotype HP:0000750
HGNC:555 biolink:has_phenotype HP:0001250
HGNC:555 biolink:has_phenotype HP:0004209
HGNC:555 biolink:has_phenotype HP:0000716
HGNC:555 biolink:has_phenotype HP:0001252
HGNC:555 biolink:has_phenotype HP:0002007
HGNC:555 biolink:has_phenotype HP:0001249
HGNC:555 biolink:has_phenotype HP:0000262
HGNC:555 biolink:has_phenotype HP:0000739
HGNC:555 biolink:has_phenotype HP:0001263
HGNC:555 biolink:has_phenotype HP:0011220
HGNC:555 biolink:has_phenotype HP:0000343
HGNC:555 biolink:has_phenotype HP:0000565
HGNC:555 biolink:has_phenotype HP:0030820
HGNC:555 biolink:has_phenotype HP:0000722
HGNC:555 biolink:has_phenotype HP:0002942
HGNC:555 biolink:has_phenotype HP:0000750
HGNC:555 biolink:has_phenotype HP:0000768
HGNC:555 biolink:has_phenotype HP:0012803
HGNC:555 biolink:has_phenotype HP:0000718
HGNC:555 biolink:has_phenotype HP:0002938
HGNC:555 biolink:has_phenotype HP:0000006
HGNC:555 biolink:has_phenotype HP:0009381
HGNC:555 biolink:has_phenotype HP:0004691
HGNC:555 biolink:has_phenotype HP:0001257
HGNC:555 biolink:has_phenotype HP:0001763
HGNC:555 biolink:has_phenotype HP:0100716
HGNC:555 biolink:has_phenotype HP:0000752
HGNC:555 biolink:has_phenotype HP:0000646
HGNC:555 biolink:has_phenotype HP:0000729

Hopefully that at least looks good.

Thank you for pointing out the discrepancy and submitting an issue, and hopefully we'll at least have a beta for the new API & site to look at soon.

chapplec commented 1 year ago

Thank you for the quick response @kevinschaper. However, we (I work with Alex who asked the question here) deal with thousands of genes in a high throughput way. Is there a way for us to download the correct data? We have been downloading from https://data.monarchinitiative.org/latest/tsv/gene_associations/index.html but if those data are not reliable, do we have another option?

  1. Can we download correct gene-disease associations from somewhere or do we need to wait for your fix?
  2. If we cannot get correct data, is there a way for us to identify which genes have this issue so we can at least flag the bad data in our system?
kevinschaper commented 1 year ago

Hi @chapplec,

We aren't producing those nice association subset files from the new pipeline yet, but we do plan to.

You can get all associations in tsv format from monarch-kg_edges.tsv within https://data.monarchinitiative.org/monarch-kg-dev/latest/monarch-kg.tar.gz, and then subset on the category field for biolink:GeneToDiseaseAssociation - and you may also want to subset on the predicate field as well.

The new graph is intentionally more limited in gene to disease associations (currently only data from OMIM) and has predicates (in biolink, which is equivalent to relation in the OBAN model) that are more accurate / cautious, in particularly with respect to claims of causation.

I know that I prefer to use delimited files for pipelines, but I'm going to go back to the sqlite database again for quick subsetting:

Quickly, these are the two predicates we're using. biolink:risk_affected_by is the stronger assertion.

sqlite3 -markdown monarch-kg.db "select distinct predicate from edges where category = 'biolink:GeneToDiseaseAssociation'"
predicate
biolink:gene_associated_with_condition
biolink:risk_affected_by

You can get them together with

sqlite3 -markdown monarch-kg.db "select subject, predicate, object from edges where category = 'biolink:GeneToDiseaseAssociation' limit 10"
subject predicate object
HGNC:2593 biolink:gene_associated_with_condition MONDO:0008730
HGNC:2593 biolink:gene_associated_with_condition MONDO:0008730
HGNC:26404 biolink:gene_associated_with_condition MONDO:0014464
HGNC:91 biolink:gene_associated_with_condition MONDO:0012392
HGNC:21024 biolink:gene_associated_with_condition MONDO:0010117
HGNC:29092 biolink:gene_associated_with_condition MONDO:0013039
HGNC:25367 biolink:gene_associated_with_condition MONDO:0013627
HGNC:6936 biolink:gene_associated_with_condition MONDO:0008861
HGNC:6937 biolink:gene_associated_with_condition MONDO:0008862
HGNC:4799 biolink:gene_associated_with_condition MONDO:0017715

You likely a way that you'd prefer to subset the tsv files as a part of a pipeline, but just to show it quickly as an sqlite3 one liner, and I'll attach the file:

sqlite3 monarch-kg.db -cmd ".mode tabs" -cmd ".headers on" "select * from edges where category = 'biolink:GeneToDiseaseAssociation'" > gene_disease.tsv

gene_disease.tsv.gz

Finally, we are still in an awkward position between the old system, which is becoming outdated and the new, which is under development and still naturally has bugs to be discovered. Unfortunately, I noticed that within that file I attached, there are 127 rows where there is an HGNC curie in the disease column (subject, for these associations). I created an issue for this bug, and we'll get it fixed ASAP.

chapplec commented 1 year ago

Sorry, @kevinschaper I just saw this! We'll have a look and see if we can work with what you've given us. Thanks for responding!

kevinschaper commented 1 year ago

I can give a little bit of an update on the odd G2D associations too, we have both MONDO to MONDO associations getting created as gene-to-disease as well as HGNC to HGNC. That investigation is happening in https://github.com/monarch-initiative/monarch-app/issues/721.

One thing that you can do with those records is look at the orignal_subject & original_object columns, which shows the step back before it goes through our ID mapping process - but probably the safest thing to do is exclude them.