monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Filter uncertain data from ClinVar #728

Closed cmungall closed 5 years ago

cmungall commented 5 years ago

@pnrobinson noticed FBN2 to Marfan association which was wrong

This comes from https://www.ncbi.nlm.nih.gov/clinvar/?term=FBN2%5Bgene%5D+marfan

image

If we bring in things marked "uncertain" then we should make it 100% clear it is uncertain. But why do we bring this in at all given we have other more reliable sources of mendelian associations

cmungall commented 5 years ago

UPDATE @kshefchek found this:

https://www.ncbi.nlm.nih.gov/clinvar/variation/375300/

which says likely pathogenic

pnrobinson commented 5 years ago

Looking at this record, I would guess that this was a data entry mistake....

kshefchek commented 5 years ago

Last summer I put in a second check for gene disease from clinvar by checking this file: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/gene_condition_source_id

However, this has since broken (or theres en error in the cypher), will take a look at both

mbrush commented 5 years ago

Note that the record is a bit strange here though, as the RCV linked to from this Variation page doesn't explicitly reference Marfan Disease (rather, the Condition is 'multiple conditions'). So wondering if this some type of entry error on ClinVar's side, as Peter suggests.

pnrobinson commented 5 years ago

CCA and Marfan are partially overlapping and the fact that a commercial lab says, maybe it was Marfan maybe it was CCA is just a sign that the submitter probably did not know which diagnosis is correct and said something like "Suspected Marfan vs. CCA". ClinVar should not be our source of truth about gene disease associations, no matter what!!

kshefchek commented 5 years ago

see also https://github.com/monarch-initiative/dipper/issues/593

cmungall commented 5 years ago

interesting to explore further, but...

ClinVar should not be our source of truth about gene disease associations, no matter what!!

+1000, we need to be more discerning

mellybelly commented 5 years ago

well, this is where we need better ways of showing what is high confidence versus the cruft. We don't want people to not have access to ClinVar, we just want them to know what is robust knowledge versus one-off variant reports (especially when they are wrong ;-))

pnrobinson commented 5 years ago

Can we find one valid D2G association that is in ClinVar but not OMIM etc? I really do not think it is a good idea not to use some minimum quality filter, especially in cases like this.

cmungall commented 5 years ago

And fixes at the UI level don't necessarily help, as we encourage people to fetch via api/queries/dumps/etc. In theory we could ameliorate this by somehow quarantining the lower quality stuff e.g. different api calls, but this increases complexity and costs resources.

justaddcoffee commented 5 years ago

Does it make sense to add and ingest a curated whitelist of provenance/reifications that we trust enough to promote an association as real? Or a blacklist of of provenance/reifications that we DON'T trust alone as proof of an association? It seems like that's what we are describing here.

E.g. in this ticket, we are saying Clinvar alone shouldn't be provenance of a variant -> disease association (in this case FGN2 -> Marfan's)

TomConlin commented 5 years ago

This is timely. I am wondering about codifying the idea of what a source is authoritative for. Early in dippers history there must have been an eagerness to get as much of whatever was available from where ever it was found. Now I think we would be better served extracting less ancillary data across ingests and replacing it with more distinct and comprehensive ingests of specific facts.

For example; have ingests only output taxon identifiers, another (NCBITaxon) ingest supplies all labels, common names, descriptions, etc. this would reduce complexity and increase consistency.

justaddcoffee commented 5 years ago

Spoke with @cmungall about this, and he favors essentially using the dipper/sources/[source].py files essentially as the whitelist I describe above. I.e. only ingesting source info that can absolutely be trusted (not ingesting as much as possible and whitelisting downstream of this using some additionally curated whitelist).

kshefchek commented 5 years ago

To clarify a couple things on ClinVar:

We do not assert gene to disease associations from the ClinVar data, but rather variant to disease, and the gene to variant. To disambiguate a causal gene-variant-disease association we use the ClinVar whitelist file - ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/gene_condition_source_id, and alter the gene to variant association to either "has_affected_feature" or "has part." @mbrush is there a better way to model this?. The intention here would be to have some relationship that states that a gene is affected by a variation (or genotype), which I believe is covered by GENO:0000418.

We infer gene to disease from these two relationships in solr, and have the ability to filter out sources, or apply some minimum source(s) to show an association, see https://github.com/monarch-initiative/monarch-app/pull/1637.

Note that the whitelist file does not contain the errant FBN2 to Marfan association, so there is a bug in this implementation (perhaps when a variant is associated to two or more diseases).

justaddcoffee commented 5 years ago

thanks @kshefchek for clarifying - ping me if I can help with the bugfix