Closed d0choa closed 4 years ago
Waiting for confirmation from @andrewhercules and @MichaelaEBI on what the table should look like
There was a preliminary discussion around combining the Common diseases
and Rare diseases
data tables. However, I think that given a recent Slack conversation, the two tables should remain distinct - see below:
I would also recommend that we leave the Common diseases
data table as it currently is because there are few cases of n/a
or unknown
.
However, for the Rare diseases
data table, I propose that we create two different tables depending on the source of the data:
Table 1: EVA, UniProt Table 2: Gene2Phenotype, UniProt literature, Genomics England
Based on a cursory glance of the evidence, we do not get the mutation
, mutation consequence
, or clinical significance
values for an evidence string from Gene2Phenotype, UniProt literature, or Genomics England. This leads to tables full of N/A
and Curated evidence
values.
For example, https://www.targetvalidation.org/evidence/ENSG00000213614/Orphanet_845?view=sec:genetic_association and https://www.targetvalidation.org/evidence/ENSG00000134460/EFO_0000540?view=sec:genetic_association.
@iandunham and @d0choa, would this make sense from a scientific perspective? Or would it preferable that the data remain in a single table?
Ian, in a recent FE meeting, you had mentioned that part of the reason behind N/A
is that we may get that information from a data provider. Would this be the case for the three data providers that I have identified as suitable for a second, curated evidence Rare diseases
data table (Gene2Phenotype, UniProt literature, Genomics England)?
@andrewhercules @d0choa @iandunham, I'd vote for the data to remain in a single table and the reasons are several fold (in no particular order)
These two data sources are curated by clinicians who don't tend to work with rsIDs. But we could map the HGVS notation to rsID and display the rsID in the table and the functional consequence, dropping the "curated evidence" placeholder.
From one of links included by @andrewhercules, we can see that G2P provides one evidence for HEXA in Tay-Sachs disease.
From the G2P link, we can go to Decipher and get more info on the G2P variants:
The two variants above do not have a dbSNP notation e.g. rs123, rather a HGVS notation, which is popular among clinicians and diagnostic labs. So we have p.Arg137Ter, which translates into:stop codon Ter at amino acid position 137, where the reference codon is Arg. The variant (or mutation) truncates the protein and this is likely pathogenic.
Is there an rsID for p.Arg137Ter? Yes, there is:
This means this variant (mutation in this case) should have an rsID (rs121907962) and a functional consequence (stop gain). Decipher gives a clinical significance as well, which could be pulled in by the Platform (clinical significance is available from ClinVar but G2P and Decipher as well, probably GEL too).
If the HGVS notation is provided in the G2P JSON file, we can use Ensembl VEP REST API endpoint to find the rsID and the functional consequence. Note G2P coordinates are for GRCh37. This'd mean that we will no longer have N/A and "curated evidence" in the table, but rs121907962 and stop gain instead.
I see the string 'Curated evidence' in our tables as a placeholder for when we don't have information on the functional consequence of the variant/mutation.
Data Download If we were to split the rare diseases table into two, this would mean that users interested in that data would have to download two rather than one table.
Consistency If we split rare diseases table in two, will we split Common disease table as well? How about other tables in the evidence page e.g. Somatic mutations? These could also have with different data sources and N/A in some rows e.g.
https://www.targetvalidation.org/evidence/ENSG00000139618/EFO_0000305?view=sec:somatic_mutation
Evidence source
column but we can't tell if it's UniProt literature. I think that the reason that gene2phenotype exists is that the variant information cannot be viewed or download outside of the terms of access of Decipher. We can't display the variants in the platform, which would include reverse engineering by scraping data from DeCipher. What we could do if we don't already is to have a link back to the page.
'Curated evidence' means that the evidence we are displaying is a summary evidence generated by a curator which may have background supporting variants, but is aggregated over several variants. In the context of the functional consequence column of the table it means that the result is an aggregate over possibly multiple variants so there isn't a functional consequence at the variant level. We could try to get a summary consequence like we do for cancer gene census, but the data access is complex here.
So overall curated evidence means that there is either a person looking at the data or there is a pipeline aggregating the data to give a summary. In a sense this is higher quality than just observing a variant.
For Genomics England panels they have a crowd sourced summary of which genes should be looked at for a particular disease, without itemising variants, so agian it's a curated summary of mny views across clinicans
It will contain: