openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Targets for Disease: Different results between _pageSize=all &_pageSize=Default #145

Open ChristineChichester opened 10 years ago

ChristineChichester commented 10 years ago

Adding the parameter _pageSize parameter "all" changes the results of Targets for Disease. Without the parameter setting UniProt data is also returned.

With Default (no parameter used): items: [ { _about: "http://identifiers.org/ncbigene/1000", inDataset: "http://rdf.imim.es/disgenet-void.ttl#gene", forDisease: { _about: "http://linkedlifedata.com/resource/umls/id/C0002395", inDataset: "http://rdf.imim.es/disgenet-void.ttl#disease", name: "Alzheimer Disease" }, seeAlso: [ { _about: "http://purl.uniprot.org/uniprot/P19022", inDataset: "http://purl.uniprot.org" }, { _about: "http://www.conceptwiki.org/concept/97eecd42-5ddd-437d-ab43-80dc7c5b2e50", inDataset: "http://www.conceptwiki.org", prefLabel_en: "Cadherin-2 (Homo sapiens)", prefLabel: "Cadherin-2 (Homo sapiens)" } ], closeMatch: { _about: "http://purl.uniprot.org/uniprot/P19022", inDataset: "http://purl.uniprot.org" }, relatedMatch: [ "http://purl.uniprot.org/uniprot/C9JMH2", "http://purl.uniprot.org/uniprot/A8MWK3", "http://purl.uniprot.org/uniprot/C9J8J8", { _about: "http://purl.uniprot.org/uniprot/P19022", inDataset: "http://purl.uniprot.org" }, "http://purl.uniprot.org/uniprot/C9J126" ] },

With _pageSize=all [ { _about: "http://identifiers.org/ncbigene/1000", inDataset: "http://rdf.imim.es/disgenet-void.ttl#gene", forDisease: { _about: "http://linkedlifedata.com/resource/umls/id/C0002395", inDataset: "http://rdf.imim.es/disgenet-void.ttl#disease", name: "Alzheimer Disease" } }, { _about: "http://identifiers.org/ncbigene/100188754", inDataset: "http://rdf.imim.es/disgenet-void.ttl#gene", forDisease: { _about: "http://linkedlifedata.com/resource/umls/id/C0002395", inDataset: "http://rdf.imim.es/disgenet-void.ttl#disease", name: "Alzheimer Disease" } },

antonisloizou commented 9 years ago

We can't use the IMS with _pageSize=all, as queries quickly become too large to process (due to the large number of mappings) and result in HTTP500: Internal Server Error.

The only way to fix this would be to load the ncbigene -> uniprot and cw -> uniprot linksets in the LDC ...but we generally avoid loading linksets in the LDC as we then can no longer use lenses...

So this is probably a "will not fix"

danidi commented 9 years ago

Just to clarify the issue: the pharmacology queries work up to approx. 10000 items, and also return CW identifiers. Are these not using the IMS? What would be the largest possible amount of data that could be returned with pageSize all? @NuriaQueralt, what is the highest count of targets/associations we would expect for a disease?

antonisloizou commented 9 years ago

What I mean is that with the current LDA architecture the behaviour for _pageSize=all needs to be consistent across all calls.

In the past, we found that if we pass the _pageSize=all pharmacology query through the IMS, we often got queries that were too large to process. In that case, we decided to keep OCRS->chembl and CW->chembl links in the LDC.

If we want to change it so _pageSize=all behaves differently for different kinds of calls (for e.g. if the number of targets per disease turns out to be small enough to be passed through the IMS), we need to make fairly significant changes to the LDA codebase.

danidi commented 9 years ago

Ok, I wasn't aware that the pharmacology calls are the exception (and that there the mappings are in the LDC), I thought it is the other way round.

NuriaQueralt commented 9 years ago

@danidi the highest count of genes for a disease in this version of DisGeNET (v2.1) is 5102 genes associated to the disease 'NEOPLASM MALIGNANT' (C0006826).