openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Non-matching gene-disease counts #44

Closed leeharland closed 8 years ago

leeharland commented 10 years ago

[from daniela] For the same gene, if I compare the counts for Diseases for target and Associations for Target, I get different results: "extendedMetadataVersion": "https://beta.openphacts.org/1.4/disease/byTarget/count?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F6772&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite", "primaryTopic": { "_about": "http://identifiers.org/ncbigene/6772", "diseaseCount": 35

"extendedMetadataVersion": "https://beta.openphacts.org/1.4/disease/assoc/byTarget/count?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F6772&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite", "primaryTopic": { "_about": "http://identifiers.org/ncbigene/6772", "associationsCount": 33

I’m not sure I really understand what the difference of the diseases and associations calls are. I thought the diseases call just return a list of diseases for a target, and the associations call then returns some more data around it. Can we actually get a disease back, but don’t have an association for it?

The diseases which are not returned by the associations call in this case are http://identifiers.org/omim/614162 and http://linkedlifedata.com/resource/umls/id/C3151088 (the first and the last disease in the list). Association for disease calls for these two diseases return a 404 response code, while target for disease returns data.

leeharland commented 10 years ago

also from daniela It seems that http://identifiers.org/omim/614162 is mapped to a many different identifiers (including proteins such as http://www.uniprot.org/uniprot/P42224, which I guess shouldn’t be the case) while http://linkedlifedata.com/resource/umls/id/C3151088 is not mapped to anything. Antonis, can this cause problems in the OPS API when retrieving the associations?

Chris-Evelo commented 10 years ago

Not sure what “many different identifiers” are that the omim entry maps to. But they should obviously both map to stat1 P42224 actually is stat1-human so I would think that one is OK. The OMIM entry mentions other genes related to the phenotype, could that be what causes the problem (if it is one)?

Obviously the UMLS entry should also map to stat1 (possibly different identifiers for it), since the disease really is “stat1 deficiency”

Chris

NuriaQueralt commented 10 years ago

From DisGeNET: These two different diseases ("CANDIDIASIS, FAMILIAL, 7" (omim:614162) and "STAT1 DEFICIENCY, COMPLETE" (umls:C3151088)) are associated to the same gene ("STAT1 signal transducer and activator of transcription 1, 91kDa [ Homo sapiens (human) ]" geneID "6772"):

Disease1: http://identifiers.org/omim/614162 is associated with only one gene: http://identifiers.org/ncbigene/6772

see gene-disease association data: http://rdf.imim.es/describe/?url=http%3A%2F%2Fidentifiers.org%2Fomim%2F614162&sid=23

see disease data: http://rdf.imim.es/describe/?url=http%3A%2F%2Fidentifiers.org%2Fomim%2F614162&sid=23

Disease2: http://linkedlifedata.com/resource/umls/id/C3151088 is associated with only one gene: http://identifiers.org/ncbigene/6772

see gene-disease association data: http://rdf.imim.es/describe/?url=http%3A%2F%2Frdf.imim.es%2Fdisgenet-gene-disease-association.ttl%23disgenet_resource%3Aassociation_geneid%3A6772_umls%3AC3151088&sid=23

see disease data: http://rdf.imim.es/describe/?url=http%3A%2F%2Flinkedlifedata.com%2Fresource%2Fumls%2Fid%2FC3151088&sid=23

You can see the mappings for "http://identifiers.org/ncbigene/6772" at:

http://rdf.imim.es/describe/?url=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F6772&sid=23

where you can see that is mapped to "http://identifiers.org/uniprot/P42224".

If you want to browse more easily DisGeNET data you can go to our faceted browser where the DisGeNET data implemented in OPS is the default browsable data:

DisGeNET Faceted browser: http://rdf.imim.es/fct/

I hope this helps. best, Núria

NuriaQueralt commented 10 years ago

Hi,

I looked at DisGeNET data and these two diseases should map to stat1 P42224. I posted a comment in the [GLOBAL] entry with more detailed data.

On 04/22/2014 11:17 PM, Chris Evelo wrote:

Not sure what “many different identifiers” are that the omim entry maps to. But they should obviously both map to stat1 P42224 actually is stat1-human so I would think that one is OK. The OMIM entry mentions other genes related to the phenotype, could that be what causes the problem (if it is one)?

The OMIM phenotype entry (614162) is only annotated to this gene 6772 (Entrez GeneID). It happens the same for the UMLS phenotype entry (C3151088). These two gene-disease associations are two different entries in DisGeNET that both come from UniProt and they have different supporting pubmeds.

Obviously the UMLS entry should also map to stat1 (possibly different identifiers for it), since the disease really is “stat1 deficiency”

This is in that way in DisGeNET data. So, the API should map the umls phenotype to the protein P42224.

Núria

Chris

— Dr. Chris Evelo Dept. Head Bioinformatics – BiGCaT Maastricht University

From: Lee Harland notifications@github.com<mailto:notifications@github.com> Reply-To: openphacts/GLOBAL reply@reply.github.com<mailto:reply@reply.github.com> Date: Tuesday 22 April 2014 23:08 To: openphacts/GLOBAL GLOBAL@noreply.github.com<mailto:GLOBAL@noreply.github.com> Subject: Re: [GLOBAL] Non-matching gene-disease counts (#44)

also from daniela It seems that http://identifiers.org/omim/614162 is mapped to a many different identifiers (including proteins such as http://www.uniprot.org/uniprot/P42224, which I guess shouldn’t be the case) while http://linkedlifedata.com/resource/umls/id/C3151088 is not mapped to anything. Antonis, can this cause problems in the OPS API when retrieving the associations?

— Reply to this email directly or view it on GitHubhttps://github.com/openphacts/GLOBAL/issues/44#issuecomment-41095966.

— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/44#issuecomment-41096985.Web Bug from https://github.com/notifications/beacon/2233557__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzgyMDY2MywiZGF0YSI6eyJpZCI6MzA1NTg1NzB9fQ==--5f0aa3ed11b078d9ab2b54385daa549ff66b2aac.gif


Núria Queralt Rosinach Research Programme on Biomedical Informatics (GRIB) Department of Experimental and Health Sciences Universitat Pompeu Fabra IMIM (Hospital del Mar Medical Research Institute) C/Dr. Aiguader 88, 08003 Barcelona, Spain Tel.: +34 93 316 0536 (1536) E-mail: nqueralt@imim.es Skype IM: nuriaqr76 http://ibi.imim.es/


danidi commented 10 years ago

"Not sure what “many different identifiers” are that the omim entry maps to." Here I was referring to mappings by the IMS. I was astonished to see a disease directly mapped to a protein ID. I thought this information should be retrieved from the rdf by disgenet, but not by directly mapping the identifiers.

danidi commented 10 years ago

From 70 different gene IDs I checked the counts of the diseases and the associations respectively: Only 2 had differences in the counts: http://identifiers.org/ncbigene/51181 misses the association for PENTOSURIA (http://linkedlifedata.com/resource/umls/id/C0268162) http://identifiers.org/ncbigene/6772 (mentioned above) misses the association for CANDIDIASIS, FAMILIAL, 7 (http://identifiers.org/omim/614162) and STAT1 DEFICIENCY, COMPLETE (http://linkedlifedata.com/resource/umls/id/C3151088)

Interestingly, only the diseases in capital letters are missing from the associations. Are these from a different datasource, or have a different format somehow?

NuriaQueralt commented 10 years ago

Hi,

In DisGeNET, once all disease terms from original data sources are mapped to UMLS concepts, we have a decision protocol to assign one of the divers disease names that each UMLS concept integrates from several vocabularies in the MRCONSO table from the UMLS Metathesaurus. UMLS concepts referring to disease phenotypes usually unify SNOMED, ICD9, MeSH, OMIM, NCI, ... clinical/biomedical terms, but not all UMLS concepts always are defined but terms from all these vocabularies. Some UMLS concepts do not have MeSH terms, others do not have OMIM terms, others do not have none of both.... For that reason, we have a decision protocol to assign a name to each UMLS concept that follows UMLS guidelines. When disease names are in capital letters means that the disease name comes from OMIM. But this is all, capital letters only implies that the name comes from OMIM but there is no different format implications in the description of related data. I hope this helps.

Cheers, Núria

On 05/13/2014 05:00 PM, danidi wrote:

From 70 different gene IDs I checked the counts of the diseases and the associations respectively: Only 2 had differences in the counts: http://identifiers.org/ncbigene/51181 misses the association for PENTOSURIA (http://linkedlifedata.com/resource/umls/id/C0268162) http://identifiers.org/ncbigene/6772 (mentioned above) misses the association for CANDIDIASIS, FAMILIAL, 7 (http://identifiers.org/omim/614162) and STAT1 DEFICIENCY, COMPLETE (http://linkedlifedata.com/resource/umls/id/C3151088)

Interestingly, only the diseases in capital letters are missing from the associations. Are these from a different datasource, or have a different format somehow?

— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/44#issuecomment-42966416.Web Bug from https://github.com/notifications/beacon/2233557__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxNTYxMjQ0OSwiZGF0YSI6eyJpZCI6MzA1NTg1NzB9fQ==--a0225dd1377c6ae44a82e639a7b7affdb593da60.gif


Núria Queralt Rosinach Research Programme on Biomedical Informatics (GRIB) Department of Experimental and Health Sciences Universitat Pompeu Fabra IMIM (Hospital del Mar Medical Research Institute) C/Dr. Aiguader 88, 08003 Barcelona, Spain Tel.: +34 93 316 0536 (1536) E-mail: nqueralt@imim.es Skype IM: nuriaqr76 http://ibi.imim.es/


NuriaQueralt commented 10 years ago

Hi,

i answered the first question of Daniela via email some time ago. Let me include it here to help solving this issue. Essentially, i cheched that the counts match at RDF data level, so i think the problem seems to be with the APIs.

Hi Daniela,

On 04/14/2014 03:32 PM, Daniela Digles wrote:

Hi,

while checking the Disease API calls from OPS I found some issues:

For the same gene, if I compare the counts for Diseases for target and Associations for Target, I get different results: "extendedMetadataVersion": "https://beta.openphacts.org/1.4/disease/byTarget/count?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F6772&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite", "primaryTopic": { "_about": "http://identifiers.org/ncbigene/6772", "diseaseCount": 35

"extendedMetadataVersion": "https://beta.openphacts.org/1.4/disease/assoc/byTarget/count?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F6772&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite", "primaryTopic": { "_about": "http://identifiers.org/ncbigene/6772", "associationsCount": 33

I’m not sure I really understand what the difference of the diseases and associations calls are. I thought the diseases call just return a list of diseases for a target, and the associations call then returns some more data around it.

Yes, this is correct. From my part, i've checked the RDF data and it seems correct because i've checked the counts from our SPARQL endpoint [@ http:/rdf.imim.es/sparql] and from our relational database data, and the results are 35 for both queries and in both formats, RDF and MySQL. The queries i performed in our sparql were:

SPARQL QUERY for number of associations for geneID 6772

select count(distinct ?a) where { ?a sio:SIO_000628 http://identifiers.org/ncbigene/6772 . } LIMIT 100

associations: 35

SPARQL QUERY for number of diseases for geneID 6772

select count(distinct ?b) where { ?a sio:SIO_000628 http://identifiers.org/ncbigene/6772 . ?a sio:SIO_000628 ?b . ?b rdf:type ncit:C7057 . } LIMIT 100

diseases: 35

Can we actually get a disease back, but don’t have an association for it?

No, this shouldn't occur. Maybe is something in the APIs?

cheers, Núria

NuriaQueralt commented 10 years ago

Daniela also mentioned:

in addition to the non-matching counts for diseases and associations, it seems we are also missing disease info on genes which are not connected to proteins. E.g. Disgenet returns "http://rdf.imim.es/disgenet-gene-disease-association.ttl#disgenet_resource:association_geneid:387578_umls:C0920350" "http://linkedlifedata.com/resource/umls/id/C0920350" "Thyroiditis, Autoimmune"

while the corresponding API call https://beta.openphacts.org/1.4/disease/byTarget?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F387578&app_id=15a18100&app_key=528a8272f1cd961d215f318a0315dd3d&_pageSize=all&_format=json gives a Page Not Found.

NuriaQueralt commented 10 years ago

From Christine and Daniela and my answers embedded:

Both Daniela and I are confused about the differences between the Disease and the Association DisGeNet API calls from Open PHACTS.

The APIs targetsByDisease and diseasesByTarget give extra info related to the entities associated, eg if the input is a gene the output is the list of associated diseases with their disease names, pref labels for input target from cw... The APIs related to the associations are devoted to give the context data related to the association concept itself, which these are mainly related to the supporting evidence of the association (original source of provenance, supporting pmids...). As one gene-disease association in DisGeNET is defined by the unique gene-disease pair, if the gene x is associated to 4 diseases, in DisGeNET there will be 4 different associations involving the gene x. (This association concept definition has changed in the upcoming version of DisGeNET RDF, so it may happen that the APIs will need a little recodification to ensure the expected results but this is a future concern.)

Should the “Target for Disease” and “Associations for Disease” calls give the same results?

yes, the calls should give the same counts. As i answered in a previous email some time ago i did some checks to investigate the origin of the issue and i performed the counts via sparql queries on our local RDF dataset (the version that is integrated in the OPS cache) and the counts match. I included my answer in the github discussion thread in order to help to find what is going wrong. My feel is that is something with the APIs, so i think Antonis could help in this regard.

If not, why are they different?

I don't know.

Núria

leeharland commented 10 years ago

OK, i think we assign this one to @antonisloizou for a next comment but i wonder if the names of the API calls need reviewing? Or at least a simple page on the help portal that describes to users what the differences in these two calls are. I've read the thread above and i think it will definitely need clarification for users - @ChristineChichester @danidi , could you take care of that part? thanks

danidi commented 9 years ago

This is still an issue on develop (for 1.5): The provided example http://identifiers.org/ncbigene/6772 gives 175 diseases and 354 associations. I could understand it if there were simply several different publications for a gene/disease relationship, leading to a different number of associations. But if you have a look at the example http://identifiers.org/ncbigene/51193 you will find 2 diseases ("Carcinoma, Squamous Cell" and "Secondary malignant neoplasm of lymph node"), but 3 associations for Carcinoma, and none for secondary malignant neoplasm of lymph node.

There are also some genes which return more diseases than associations, e.g. http://identifiers.org/ncbigene/23048 returns 2 and 1.

antonisloizou commented 9 years ago

So for the first instance we are talking about the difference between:

http://ops2.few.vu.nl/disease/byTarget?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F51193

and

http://ops2.few.vu.nl/disease/assoc/byTarget?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F51193

This is because a sio:SIO_000095 ?diseaseClass is required for Associations By Target but not for Diseases by target.

We can either make this optional in associations (bringing the count to 4 by addidng the "Secondary malignant neoplasm of lymph node" one)

or

Make the diseaseClass required to "Diseases by target" (i.e. it would only return 1 disease)

BTW - it would make it quicker for me if you could paste the API requests rather than the inputs :)

antonisloizou commented 9 years ago

Changed the disease class to optional for both methods for the time being, to be discussed in the developers TC. @NuriaQueralt

NuriaQueralt commented 9 years ago

Hi all,

This is still an issue on develop (for 1.5):

QUESTION 1: The provided example http://identifiers.org/ncbigene/6772 gives 175 diseases and 354 associations. I could understand it if there were simply several different publications for a gene/disease relationship, leading to a different number of associations.

First of all, i would like to clarify that in our data model what defines each different association between a gene and a disease is the unique combination of all the evidence that we track, that is: the association type between the gene and the disease, the source from where we imported the gene-disease association, and the PubMed publication that supports it. Therefore, a gene-disease pair can have different associations based on the different types of associations stated, sources, or publications. for instance a gene-disease pair may have two different associations of the form: they are stated by the same publication, gathered in the same source, but with two different types of associations described.

For the case of the gene http://identifiers.org/ncbigene/6772, DisGeNET have collected 175 associated-diseases, with 401 different associations! and no 354 associations. This inconsistency in the total number of associations should be investigated. Bear in mind the different versions of DisGeNET running here (open in the Web, i.e. without OMIM associations integrated, (where you can check there are 392 associations), and the open with OMIM in the Open PHACTS platform (where there are 9 more associations coming from OMIM)). For that reason, if you check results in our Web platform (http://www.disgenet.org/) or our public SPARQL endpoint (http://rdf.disgenet.org/sparql/), some pairs may have a lower total number of associations in the Web (see it for this gene in our browser http://www.disgenet.org/web/DisGeNET/v2.1/browser/tab8a?10&pview=default&pf=data/sources::ALL::de&pf=data/genes::6772::de, or in our sparql http://goo.gl/lhjAdu)

QUESTION 2: But if you have a look at the example http://identifiers.org/ncbigene/51193 you will find 2 diseases ("Carcinoma, Squamous Cell" and "Secondary malignant neoplasm of lymph node"), but 3 associations for Carcinoma, and none for secondary malignant neoplasm of lymph node.

From the api call results that Antonis passed for target 'ncbigene:51193':

http://ops2.few.vu.nl/disease/byTarget?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F51193

and

http://ops2.few.vu.nl/disease/assoc/byTarget?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F51193

what i see is that this target is associated with two to disorders: "Carcinoma, Squamous Cell" and "Secondary malignant neoplasm of lymph node". There are three associations for 'Carcinoma' as they are supported by three different publications, and this disorder has a MeSH disease class associated. And, there is one association for 'secondary malignant neoplasm of lymph node' as there is supported by one publication, and this disorder has NO MeSH disease class associated. All these results are correct, and they agree with what we have in the DisGeNET platform (open and private agree in this case: http://www.disgenet.org/web/DisGeNET/v2.1/browser/tab8a?3&pview=default&pf=data/sources::ALL::de&pf=data/genes::51193::de).

QUESTION 3: There are also some genes which return more diseases than associations, e.g. http://identifiers.org/ncbigene/23048 returns 2 and 1.

In DisGeNET, there are two different diseases associated with 'ncbigene:23048' : C1290884 and C0040336

This should be the results about the associations (open and private agree in this case):

http://www.disgenet.org/web/DisGeNET/v2.1/browser/tab8a?4&pview=default&pf=data/sources::ALL::de&pf=data/genes::23048::de

cheers, Núria

On 04/01/2015 01:48 PM, danidi wrote:

This is still an issue on develop (for 1.5): The provided example http://identifiers.org/ncbigene/6772 gives 175 diseases and 354 associations. I could understand it if there were simply several different publications for a gene/disease relationship, leading to a different number of associations. But if you have a look at the example http://identifiers.org/ncbigene/51193 you will find 2 diseases ("Carcinoma, Squamous Cell" and "Secondary malignant neoplasm of lymph node"), but 3 associations for Carcinoma, and none for secondary malignant neoplasm of lymph node.

There are also some genes which return more diseases than associations, e.g. http://identifiers.org/ncbigene/23048 returns 2 and 1.

— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/44#issuecomment-88448780.Web Bug from https://github.com/notifications/beacon/ACIU1aGWtHyktlNDGQD4RW8cEtwtDnrKks5n69KkgaJpZM4B0klq.gif


Núria Queralt Rosinach Research Programme on Biomedical Informatics (GRIB) Department of Experimental and Health Sciences Universitat Pompeu Fabra IMIM (Hospital del Mar Medical Research Institute) C/Dr. Aiguader 88, 08003 Barcelona, Spain Tel.: +34 93 316 0536 (1536) E-mail: nqueralt@imim.es Skype IM: nuriaqr76 http://ibi.imim.es/


danidi commented 9 years ago

Hi Nuria, thank you for your detailed explanation. Antonis updated the API calls in the meantime, so the results you saw are a bit different from the ones I reported. The association count for http://identifiers.org/ncbigene/6772 is now 398 (see http://ops2.few.vu.nl/disease/assoc/byTarget/count?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F6772). So we are still missing 3 from DisGeNET it seems.

NuriaQueralt commented 9 years ago

Let me know, what do you need from me to trace this error back.

cheers, Núria

On 04/02/2015 11:07 AM, danidi wrote:

Hi Nuria, thank you for your detailed explanation. Antonis updated the API calls in the meantime, so the results you saw are a bit different from the ones I reported. The association count for http://identifiers.org/ncbigene/6772 is now 398 (see http://ops2.few.vu.nl/disease/assoc/byTarget/count?uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F6772). So we are still missing 3 from DisGeNET it seems.

— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/44#issuecomment-88836233.Web Bug from https://github.com/notifications/beacon/ACIU1WKbfQCXnv5aAvDJ2nRO89V75i9Gks5n7P5IgaJpZM4B0klq.gif


Núria Queralt Rosinach Research Programme on Biomedical Informatics (GRIB) Department of Experimental and Health Sciences Universitat Pompeu Fabra IMIM (Hospital del Mar Medical Research Institute) C/Dr. Aiguader 88, 08003 Barcelona, Spain Tel.: +34 93 316 0536 (1536) E-mail: nqueralt@imim.es Skype IM: nuriaqr76 http://ibi.imim.es/


nicklynch commented 8 years ago

@NuriaQueralt Is this still an issue 1.5?

NuriaQueralt commented 8 years ago

@nicklynch no it is not. this issue was solved. the api calls now give the correct values.

NuriaQueralt commented 8 years ago

No, that issue was solved.

cheers, n

On 10/10/2015 11:00 PM, Nick Lynch wrote:

@NuriaQueralt https://github.com/NuriaQueralt Is this still an issue 1.5?

— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/44#issuecomment-147125456.Web Bug from https://github.com/notifications/beacon/ACIU1bimqbYBez7JGnsobNrokioKIxGiks5o6XPxgaJpZM4B0klq.gif


Núria Queralt Rosinach Research Programme on Biomedical Informatics (GRIB) Department of Experimental and Health Sciences Universitat Pompeu Fabra IMIM (Hospital del Mar Medical Research Institute) C/Dr. Aiguader 88, 08003 Barcelona, Spain Tel.: +34 93 316 0536 (1536) E-mail: nqueralt@imim.es Skype IM: nuriaqr76 http://ibi.imim.es/