Closed pnrobinson closed 4 years ago
Something has gone horribly wrong with our clinvar ingest; I'm taking a closer look. The view on production is what is intended:
https://monarchinitiative.org/variant/ClinVarVariant:36075 https://monarchinitiative.org/variant/ClinVarVariant:44739
Tracked this down to a one line logical error that I made, https://github.com/monarch-initiative/dipper/pull/804/files#diff-0ec4ec6a2e21e9b4b0f99bb71d3f312aR316
Should have this fixed in the next data load
https://beta.monarchinitiative.org/variant/ClinVarVariant:44739 is fixed https://beta.monarchinitiative.org/variant/ClinVarVariant:36075 shows the 8 diseases from this RCV: https://www.ncbi.nlm.nih.gov/clinvar/RCV000768217/
@pnrobinson can you review?
Unfortunately, the information on ClinVar is wrong. I suspect that this submitted just listed all of the diseases that are associated with FBN1 and not the disease that was found in this observation https://www.ncbi.nlm.nih.gov/clinvar/RCV000768217.1/ There is no way to know 100% what the lab is reporting, but I am pretty certain that this is a mistake. Among other things, our group in Berlin was the first to describe the lipodystrophy syndrome and it is caused by a small subset of mutations in the penultimate exon -- and this is not that exon. Therefore, we need to expect that data from ClinVar that is coming from commercial labs is sloppy. I think that the best we can do without detailed curation is either
We could also filter based on review status, for example only keep RCVs with >2 stars, https://www.ncbi.nlm.nih.gov/clinvar/docs/review_status/
From #204 - where @pnrobinson contributed:
From @kshefchek I made a small update so that we are not calling associations where we only have evidence from ClinVar "causal" (solr only change). However, the [less than optimal] variant and G2P data is still on production and beta. You can see the current state of things on production: https://monarchinitiative.org/disease/MONDO:0007113
Notice that the MECP2 gene is on the genes (other) tab, and the MECP2 variant is still on the variant tab. Getting this right will take some updates when we ingest ClinVar to triples. My understanding is that the plan is to split up high quality vs low quality variant to disease associations by using ClinVar's scoring system, somewhat similar to Exomiser.
I can make some more quick changes at the indexing level, such as removing variant to disease associations from ClinVar. These changes are relatively simple because ClinVar is the only source where we use the pathogenic and likely pathogenic relationships. Please let me know if I should do this as well.
from @pnrobinson
It would be good to do some due diligence about where the errors are coming from and then to report this back. I have seen two types of error, and I do not think that they necessarily will relate to the ClinVar score in the way we need. @kshefchek, is it easy for you to extract a table from the CLinVar data that will show us all of the data for the genes that are not OMIM genes? This should be the sort of thing that one can do on the UI anyway, so maybe we should set this up if we do not have it?
After we have this data, it should be easier to make a decision as to what to do
from @kshefchek Here is a list of variant to disease associations that I think have a higher likelihood of being errors. I'm hoping these all have a low review status rating by ClinVar. I'll take a look myself but let us know what you find!
from @pnrobinson:
Thanks, Kent! Can I suggest that we do a deep dive on individual genes and try to find out how to improve the pipeline? There are lots of issues but hopefully they have related causes and if we start to solve them for the first ten genes, maybe things will start to look better overall. I have started a document here. See also below for a quick impression.
Part of the problem seems to be from incorrect MONDO associations, and part of the problem seems to be from incorrect ClinVar mapping.
FLCN - Problems 1. Disease associations Monarch has this https://monarchinitiative.org/gene/HGNC%3A27310#disease-causal
Monarch | OMIM |
---|---|
Birt-Hogg-Dube syndrome | Birt-Hogg-Dube syndrome |
familial spontaneous pneumothorax | Pneumothorax, primary spontaneous |
familial colorectal cancer | Colorectal cancer, somatic |
nonpapillary renal cell carcinoma | Renal carcinoma, chromophobe, somatic |
Analysis
Birt-Hogg-Dube syndrome => Matches!
familial spontaneous pneumothorax => We chose a name that differs from OMIM, but seems to be just as correct. Nonetheless, we are guilty of xkcd:927. Why is there a need to introduce a new name?
familial colorectal cancer => Here, we have made a modelling mistake. There is a big difference between Colorectal cancer, somatic and familial colorectal cancer (the former means that a mutation occurs in colorectal tissue, and the latter implies that a mutation is transmitted in the germline). The problem appears to be that this OMIM entry includes genes that are of both categories (https://omim.org/entry/114500). As far as I can see, FLCN has never been implicated in the familial form, and the OMIM page states: Nahorski et al. (2010) did not find any germline mutations in the FLCN gene among 50 patients with familial nonsyndromic colorectal cancer.
nonpapillary renal cell carcinoma. I believe that Renal carcinoma, chromophobe, somatic is a particular subtype of nonpapillary renal cell carcinoma, but these are not synonymous. See https://www.ncbi.nlm.nih.gov/medgen/463622
Mutation
ClinVarVariant:253251 NM_144997.5(FLCN):c.1429C>T (p.Arg477Ter) is noted to be related to Potocki-Lupski syndrome (PTLS) in Kent’s analysis.
Nonetheless, if I search for this mutation on the old website I get: NM_144997.5(FLCN):c.1429C>T (p.Arg477Ter) has 11935 matches
If I search on the new website, I get NM_144997.5(FLCN):c.1429C>T (p.Arg477Ter) has 297106 matches
The top hit of the search is our mutation. It is listed as being associated with all four of the Monarch diseases listed above and additionally with PTLS.
However, I cannot find the PTLS association in ClinVar https://www.ncbi.nlm.nih.gov/clinvar/variation/253251/
I am going to guess that the reason we are picking up this association is that FLCN is one of roughly 17 genes located in the CNV that causes PTLS: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2833368/
Nonetheless, this is incorrect as “the dosage-sensitive gene RAI1 is likely to be responsible for the predominant clinical features of the PTLS phenotype associated with 17p11.2 duplications.“
from @kshefchek: For the issues that are coming from ClinVar I'll look at the RCVs and document the review status in hopes that this can be a good filter. It's sometimes challenging to find these and we don't cache the RCV identifier (we should update our indexing to do this). For example, the FLCN to PTL association is from https://www.ncbi.nlm.nih.gov/clinvar/RCV000762980.1/
from @pnrobinson (partial entry):
I see. This ClinVar entry is a prime example of the major mapping problem I have been seeing -- it lists multiple conditions for the same mutation. AFAIK a single clinvar entry is supposed to derive from a single observation, i.e., a single patient. This is why one can see multiple ClinVar entries that all support the pathogenicity of a single mutation. In this case, the single entry has five diseases:
Can we start parsing these into action items? The issues described about this page https://monarchinitiative.org/gene/HGNC%3A27310#disease-causal are possibly mondo issues. I'm still interested if clinvar has incorrect variant to disease associations with a high review status.
Yep, yep. I'm working on this right now. But we first needed to collate the large volume of messages that have been exchanged outside of the repo.
Current behaviour: there are 2 document types, causal and non-causal gene-to-disease data. Currently, for us to show gene-to-disease data from ClinVar, the association must have:
Can we start parsing these into action items?
@kshefchek I created this ticket #1021 in the Mondo repo.
See also commentary on the the disease group / subclass_of
associations on the Mondo repo in this ticket #685.
@kshefchek - please sanity check here.
https://beta.monarchinitiative.org/variant/ClinVarVariant:44739#disease
We should show the diseases associated with this paricular variant (available from ClinVar)