monarch-initiative / monarch-ui

The previous version of the Monarch Initiative website
https://previous.monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
17 stars 29 forks source link

Data errors in clinvar submissions (SCVs) #139

Closed pnrobinson closed 4 years ago

pnrobinson commented 5 years ago

https://beta.monarchinitiative.org/variant/ClinVarVariant:44739#disease

We should show the diseases associated with this paricular variant (available from ClinVar)

kshefchek commented 5 years ago

Something has gone horribly wrong with our clinvar ingest; I'm taking a closer look. The view on production is what is intended:

https://monarchinitiative.org/variant/ClinVarVariant:36075 https://monarchinitiative.org/variant/ClinVarVariant:44739

kshefchek commented 5 years ago

Tracked this down to a one line logical error that I made, https://github.com/monarch-initiative/dipper/pull/804/files#diff-0ec4ec6a2e21e9b4b0f99bb71d3f312aR316

Should have this fixed in the next data load

kshefchek commented 5 years ago

https://beta.monarchinitiative.org/variant/ClinVarVariant:44739 is fixed https://beta.monarchinitiative.org/variant/ClinVarVariant:36075 shows the 8 diseases from this RCV: https://www.ncbi.nlm.nih.gov/clinvar/RCV000768217/

@pnrobinson can you review?

pnrobinson commented 5 years ago

Unfortunately, the information on ClinVar is wrong. I suspect that this submitted just listed all of the diseases that are associated with FBN1 and not the disease that was found in this observation https://www.ncbi.nlm.nih.gov/clinvar/RCV000768217.1/ There is no way to know 100% what the lab is reporting, but I am pretty certain that this is a mistake. Among other things, our group in Berlin was the first to describe the lipodystrophy syndrome and it is caused by a small subset of mutations in the penultimate exon -- and this is not that exon. Therefore, we need to expect that data from ClinVar that is coming from commercial labs is sloppy. I think that the best we can do without detailed curation is either

  1. Take disease associations that are related to a specific pubmed (note -- some of the pubmeds refer to the ACMG criteria and not the disease)
  2. Drop the individual disease associations altogether. I would favor this and would hope that we can find a way to add them back with the new grant @mellybelly @cmungall -- this is a major problem with the way we are modelling ClinVar and I am proposing a major change -- please chime in and let's discuss this!
kshefchek commented 5 years ago

We could also filter based on review status, for example only keep RCVs with >2 stars, https://www.ncbi.nlm.nih.gov/clinvar/docs/review_status/

monicacecilia commented 5 years ago

From #204 - where @pnrobinson contributed:

monicacecilia commented 5 years ago

From @kshefchek I made a small update so that we are not calling associations where we only have evidence from ClinVar "causal" (solr only change). However, the [less than optimal] variant and G2P data is still on production and beta. You can see the current state of things on production: https://monarchinitiative.org/disease/MONDO:0007113

Notice that the MECP2 gene is on the genes (other) tab, and the MECP2 variant is still on the variant tab. Getting this right will take some updates when we ingest ClinVar to triples. My understanding is that the plan is to split up high quality vs low quality variant to disease associations by using ClinVar's scoring system, somewhat similar to Exomiser.

I can make some more quick changes at the indexing level, such as removing variant to disease associations from ClinVar. These changes are relatively simple because ClinVar is the only source where we use the pathogenic and likely pathogenic relationships. Please let me know if I should do this as well.

monicacecilia commented 5 years ago

from @pnrobinson

It would be good to do some due diligence about where the errors are coming from and then to report this back. I have seen two types of error, and I do not think that they necessarily will relate to the ClinVar score in the way we need. @kshefchek, is it easy for you to extract a table from the CLinVar data that will show us all of the data for the genes that are not OMIM genes? This should be the sort of thing that one can do on the UI anyway, so maybe we should set this up if we do not have it?

After we have this data, it should be easier to make a decision as to what to do

monicacecilia commented 5 years ago

from @kshefchek Here is a list of variant to disease associations that I think have a higher likelihood of being errors. I'm hoping these all have a low review status rating by ClinVar. I'll take a look myself but let us know what you find!

clinvar-variants.txt

monicacecilia commented 5 years ago

from @pnrobinson:

Thanks, Kent! Can I suggest that we do a deep dive on individual genes and try to find out how to improve the pipeline? There are lots of issues but hopefully they have related causes and if we start to solve them for the first ten genes, maybe things will start to look better overall. I have started a document here. See also below for a quick impression.

Part of the problem seems to be from incorrect MONDO associations, and part of the problem seems to be from incorrect ClinVar mapping.

FLCN - Problems 1. Disease associations Monarch has this https://monarchinitiative.org/gene/HGNC%3A27310#disease-causal

Monarch OMIM
Birt-Hogg-Dube syndrome Birt-Hogg-Dube syndrome
familial spontaneous pneumothorax Pneumothorax, primary spontaneous
familial colorectal cancer Colorectal cancer, somatic
nonpapillary renal cell carcinoma Renal carcinoma, chromophobe, somatic

Analysis

  1. Birt-Hogg-Dube syndrome => Matches!

  2. familial spontaneous pneumothorax => We chose a name that differs from OMIM, but seems to be just as correct. Nonetheless, we are guilty of xkcd:927. Why is there a need to introduce a new name?

  3. familial colorectal cancer => Here, we have made a modelling mistake. There is a big difference between Colorectal cancer, somatic and familial colorectal cancer (the former means that a mutation occurs in colorectal tissue, and the latter implies that a mutation is transmitted in the germline). The problem appears to be that this OMIM entry includes genes that are of both categories (https://omim.org/entry/114500). As far as I can see, FLCN has never been implicated in the familial form, and the OMIM page states: Nahorski et al. (2010) did not find any germline mutations in the FLCN gene among 50 patients with familial nonsyndromic colorectal cancer.

  4. nonpapillary renal cell carcinoma. I believe that Renal carcinoma, chromophobe, somatic is a particular subtype of nonpapillary renal cell carcinoma, but these are not synonymous. See https://www.ncbi.nlm.nih.gov/medgen/463622

Mutation

ClinVarVariant:253251 NM_144997.5(FLCN):c.1429C>T (p.Arg477Ter) is noted to be related to Potocki-Lupski syndrome (PTLS) in Kent’s analysis.

Nonetheless, if I search for this mutation on the old website I get: NM_144997.5(FLCN):c.1429C>T (p.Arg477Ter) has 11935 matches

If I search on the new website, I get NM_144997.5(FLCN):c.1429C>T (p.Arg477Ter) has 297106 matches

The top hit of the search is our mutation. It is listed as being associated with all four of the Monarch diseases listed above and additionally with PTLS.

However, I cannot find the PTLS association in ClinVar https://www.ncbi.nlm.nih.gov/clinvar/variation/253251/

I am going to guess that the reason we are picking up this association is that FLCN is one of roughly 17 genes located in the CNV that causes PTLS: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2833368/

Nonetheless, this is incorrect as “the dosage-sensitive gene RAI1 is likely to be responsible for the predominant clinical features of the PTLS phenotype associated with 17p11.2 duplications.“

monicacecilia commented 5 years ago

from @kshefchek: For the issues that are coming from ClinVar I'll look at the RCVs and document the review status in hopes that this can be a good filter. It's sometimes challenging to find these and we don't cache the RCV identifier (we should update our indexing to do this). For example, the FLCN to PTL association is from https://www.ncbi.nlm.nih.gov/clinvar/RCV000762980.1/

monicacecilia commented 5 years ago

from @pnrobinson (partial entry):

I see. This ClinVar entry is a prime example of the major mapping problem I have been seeing -- it lists multiple conditions for the same mutation. AFAIK a single clinvar entry is supposed to derive from a single observation, i.e., a single patient. This is why one can see multiple ClinVar entries that all support the pathogenicity of a single mutation. In this case, the single entry has five diseases:

kshefchek commented 5 years ago

Can we start parsing these into action items? The issues described about this page https://monarchinitiative.org/gene/HGNC%3A27310#disease-causal are possibly mondo issues. I'm still interested if clinvar has incorrect variant to disease associations with a high review status.

monicacecilia commented 5 years ago

Yep, yep. I'm working on this right now. But we first needed to collate the large volume of messages that have been exchanged outside of the repo.

monicacecilia commented 4 years ago

Current behaviour: there are 2 document types, causal and non-causal gene-to-disease data. Currently, for us to show gene-to-disease data from ClinVar, the association must have:

monicacecilia commented 4 years ago

Can we start parsing these into action items?

@kshefchek I created this ticket #1021 in the Mondo repo.

monicacecilia commented 4 years ago

See also commentary on the the disease group / subclass_of associations on the Mondo repo in this ticket #685.

monicacecilia commented 4 years ago

Summary:

@kshefchek - please sanity check here.