openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Missing data for diseases not linked to proteins #45

Open leeharland opened 10 years ago

leeharland commented 10 years ago

[from daniela] in addition to the non-matching counts for diseases and associations, it seems we are also missing disease info on genes which are not connected to proteins. E.g. Disgenet returns "http://rdf.imim.es/disgenet-gene-disease-association.ttl#disgenet_resource:association_geneid:387578_umls:C0920350" "http://linkedlifedata.com/resource/umls/id/C0920350" "Thyroiditis, Autoimmune"

while the corresponding API call https://beta.openphacts.org/1.4/disease/byTarget?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F387578&app_id=15a18100&app_key=528a8272f1cd961d215f318a0315dd3d&_pageSize=all&_format=json gives a Page Not Found.

Chris-Evelo commented 10 years ago

Strange… I don’t really understand Disgenet’s behaviour here.

Disgenet seems to return a lot of things associated with other UMLS terms than the one requested. Is that normal?

Also for UMLS C09020350 it returns: http://www.ncbi.nlm.nih.gov/gene/7276 (via identifiers.org) transthyretin, which is a thyroid hormone transporting albumin. The disease however is an autoimmune disease that destroys the gland. The relationship with the transporter seems far fetched to me.

— Reply to this email directly or view it on GitHubhttps://github.com/openphacts/GLOBAL/issues/45.

NuriaQueralt commented 10 years ago

Hi,

On 04/23/2014 10:29 AM, Chris Evelo wrote:

Strange… I don’t really understand Disgenet’s behaviour here.

Disgenet seems to return a lot of things associated with other UMLS terms than the one requested. Is that normal? Could you post an example in order to analyse more exactly the problem?

Also for UMLS C09020350 it returns: http://www.ncbi.nlm.nih.gov/gene/7276 (via identifiers.org) transthyretin, which is a thyroid hormone transporting albumin. The disease however is an autoimmune disease that destroys the gland. The relationship with the transporter seems far fetched to me.

This association: umls:C0920350-geneid:7276, has only one DisGeNET entry that come originary from GAD and has only one supporting publication, see data at:

http://rdf.imim.es/describe/?url=http%3A%2F%2Frdf.imim.es%2Fdisgenet-gene-disease-association.ttl%23disgenet_resource%3Aassociation_geneid%3A7276_umls%3AC0920350

I would like to comment on GAD data that testing DisGeNET gene-disease data, we detected some inconsistencies coming from GAD data. So in order to overcome those we've changed the phenotype mapping methology applied to GAD diseases. I've checked if this association appears in the new release of DisGeNET and I have seen that it does not appear.

best, n

— Reply to this email directly or view it on GitHubhttps://github.com/openphacts/GLOBAL/issues/45.

— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/45#issuecomment-41136538.Web Bug from https://github.com/notifications/beacon/2233557__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzg2MDk0MCwiZGF0YSI6eyJpZCI6MzA1NTg4OTR9fQ==--bca9eace67a47c678eda91a6a385d48d127288a9.gif


Núria Queralt Rosinach Research Programme on Biomedical Informatics (GRIB) Department of Experimental and Health Sciences Universitat Pompeu Fabra IMIM (Hospital del Mar Medical Research Institute) C/Dr. Aiguader 88, 08003 Barcelona, Spain Tel.: +34 93 316 0536 (1536) E-mail: nqueralt@imim.es Skype IM: nuriaqr76 http://ibi.imim.es/


Chris-Evelo commented 10 years ago

Yes, but the request at the top was for a different gene (387578) that returns a lot of things if I execute it like given there.

NuriaQueralt commented 8 years ago

yes, i do confirm that this issue remains for the 1.5 api.

On one hand, all disease counts i checked are 0 in the api for disease genes without proteins annotated. This should be solved as all genes in disgenet have at least one disease associated.

Bear in mind that not all genes have a protein(s) annotated. So in this regard, would it be more intuitive to require as input a gene URI and not the protein URI? at least for the disgenet calls...

On the other hand, disease counts i checked in the api for disease genes with proteins annotated are correct.

NuriaQueralt commented 8 years ago

yes, i do confirm that this issue remains for the 1.5 api.

On one hand, all disease counts i checked are 0 in the api for disease genes without proteins annotated. This should be solved as all genes in disgenet have at least one disease associated.

Bear in mind that not all genes have a protein(s) annotated. So in this regard, would it be more intuitive to require as input a gene URI and not the protein URI? at least for the disgenet calls...

On the other hand, disease counts i checked in the api for disease genes with proteins annotated are correct.

cheers, n

On 10/25/2015 11:35 PM, Nick Lynch wrote:

Assigned #45 https://github.com/openphacts/GLOBAL/issues/45 to @NuriaQueralt https://github.com/NuriaQueralt.

— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/45#event-445005438.Web Bug from https://github.com/notifications/beacon/ACIU1VFAjBOs4VPHdbWtlTCnTdaA6bGOks5o_VCtgaJpZM4B0kqu.gif


Núria Queralt Rosinach Research Programme on Biomedical Informatics (GRIB) Department of Experimental and Health Sciences Universitat Pompeu Fabra IMIM (Hospital del Mar Medical Research Institute) C/Dr. Aiguader 88, 08003 Barcelona, Spain Tel.: +34 93 316 0536 (1536) E-mail: nqueralt@imim.es Skype IM: nuriaqr76 http://ibi.imim.es/


danidi commented 8 years ago

I agree. It would be better to have void:inDataset http://purl.uniprot.org optional for the disease calls.

NuriaQueralt commented 8 years ago

In DisGeNET for OPS there are a total of 16854 genes associated to diseases, from which 2827 genes do not have a uniprotid associated. The following are the top10 disease genes without uniprot and the total number of associated diseases in DisGeNET:

+-----------+----------+ | geneid | # diseases | +-----------+----------+ | 3342 | 513 | | 4397 | 226 | | 50818 | 194 | | 283120 | 153 | | 8163 | 131 | | 2477 | 125 | | 101669765 | 120 | | 406991 | 110 | | 450095 | 109 | | 3492 | 109 | +-----------+----------+

Currently the api call 'diseases4Target::count' returns 0 for these genes and it shouldn't.

Regarding genes with uniprot associated in DisGeNET, i checked some and there is always agreement between DisGeNET and OPS api disease counts:

+--------+----------+----------------------------------------------+ | geneid | # diseases in DisGeNET | # diseases in OPS| +--------+----------+----------------------------------------------+ | 7157 | 1540 | 1540 | | 7124 | 1353 | 1353 | | 3569 | 1075 | 1075 | | 7422 | 894 | 894 | | 920 | 809 | 809 | | 3586 | 795 | 795 | | 1437 | 783 | 783 | | 1029 | 762 | 762 | | 3630 | 738 | 738 | | 3458 | 731 | 731 | +--------+----------+-------+

I also checked some cases for the 'targets4Disease::count' api call and the total number of genes per disease is correct, still if there are involved some genes without uniprot:

+-----------+-------+-------------------------------------- | diseaseid | # genes | # genes without uniprot | +-----------+-------+------------------------------------- | C0006826 | 5102 | 254 | | C0006142 | 3325 | 142 | | C0040336 | 3112 | 41 | | C0678222 | 3072 | 130 |

Finally, the recomendation from @danidi is still not included i think: It would be better to have void:inDataset http://purl.uniprot.org optional for the disease calls.

NuriaQueralt commented 8 years ago

it happens the same for the 'associations4target:count' api call, it yields to zero in cases it shouldnt, eg the following counts for genes without uniprot in DisGeNET yielded to 0 in the api even when the gene is the input, eg http://identifiers.org/ncbigene/geneid:

+--------+-------+ | geneid | # associations | +--------+-------+ | 3 | 1 | | 17 | 3 | | 46 | 4 | | 62 | 3 | | 68 | 1 | +--------+-------+

whereas, the count is correct when i input a gene with a uniprot, eg http://identifiers.org/ncbigene/1 yields to 10.

NuriaQueralt commented 8 years ago

for the 'associations4disease:count' api call, the following diseases yielded to a non correct count and it may be due to the same issue:

+-----------+-------+------------------------------------------- | diseaseid | # assoc in DisGeNET | # assoc in OPS | +-----------+-------+------------------------------------------- | C0004238 | 164 | 162 | C0000737 | 29 | 27 | C0000744 | 134 | 132 | C0000768 | 1598 | 1495