Closed DSuveges closed 3 years ago
There is no action in our side, EVA need to add this to their pipeline. I'll make sure that Kirill has a ticket for this.
Better said, the action on our side is to adapt the JSON schema so that it accepts all the clinical significance values used by ClinVar. The ones currently accepted are below:
"clinical_significance": {
"type": "string",
"enum": [
"Pathogenic",
"Likely pathogenic",
"protective",
"association",
"risk_factor",
"Affects",
"drug response"
]
}
Extracted from the JSON schema.
This table in the ClinVar documentation suggests that the list above may be incomplete:
Kirill has agreed to report all variants (work can be tracked on EBIvariation/eva-opentargets#122 and he will provide a list of the possible clinical significance values soon.
@DSuveges @AsierGonzalez (also copying @tcezard @d0choa @iandunham)
Here is the distribution of all possible clinical significance levels in ClinVar:
Under the current criteria, only 188,518 out of 1,114,689 (17%) records contain clinical significance levels which are processed by our pipeline. The ones which are not processed fall into two categories:
,
or /
.To start processing the types which are currently not processed, we need to do several things:
risk_factor
/ risk factor
and affects
/Affects
. To resolve this, I suggest specifying all values in the JSON schema in lowercase and using spaces, e.g.: risk factor
, affects
, likely pathogenic
, etc. Where ClinVar spells them differently, we'll handle the conversion in our pipeline.likely pathogenic
is mapped to http://identifiers.org/cttv.activity/predicted_damaging. We need to have such mappings for all clinical significance types.not provided
; Conflicting interpretations of pathogenicity
; other
) or combine multiple values (Pathogenic, Affects
; Benign/Likely benign
).Please let me know how you would like to proceed here.
Thank you very much for doing all these extensive comparisons @tskir! It's really helpful in making the decisions. Let me go through all the points you have raised.
.target.activity
value anywhere so just omit that. We'll make that value in the json schema optional."clinical_significance": [
"pathogenic",
"risk factor"
],
@DSuveges (copying @tcezard)
I agree with all the points you raised. I will proceed to implement the changes on our side in an experimental branch.
@DSuveges I'm currently working on this issue and I have a follow-up question about the ['evidence']['gene2variant']['is_associated']
and ['evidence']['variant2disease']['is_associated']
attributes. In the existing approach, we always set them to True, because only (likely) pathogenic and other related variants are processed. Should I continue to always set them both to True, even for variants which are benign etc.?
Or rather, to be precise, there is currently a calculation of both of those attributes which looks like this:
self.association = clinvar_record.clinical_significance not in ('likely benign', 'benign')
However, under the current approach, it always sets them both to True since "benign" and "likely benign" variants are being filtered out even before going into the evidence string generation phase.
Should I keep the calculation in exactly this way? Should I add more clinical significance levels which should not be considered an "association"? Or should I just always set them to True regardless?
Hi Kiril,
We have discussed this question with the team. There's a bit of history behind this value: it supposed to distinguish between positive and negative evidence hence the true/false binary. In this interpretation, this value should remain to be set true even for benign and likely benign evidences.
Thank you so much and please just ask if you have any other questions.
@AsierGonzalez @DSuveges This is done, currently undergoing our internal review. Changes are described in depth in our PR https://github.com/EBIvariation/eva-opentargets/pull/146. Summary:
'Benign/Likely benign, risk_factor'
→ ['benign', 'likely benign', 'risk factor']
evidence.gene2variant.is_associated
and evidence.variant2disease.is_associated
fields to Truetarget.activity
attributeI also ran the evidence string comparison protocol, and, as expected, their number rose significantly: from 116023 (batch 2020-09, standard pipeline) to 499174 (batch 2020-09, updated pipeline), an increase of 4.3×
That sounds fantastic, thanks @tskir
Since there were no corrections raised for this issue, and since it already supports both types of evidence strings, it's been reviewed and merged on our side. To be compatible with the JSON schema, it will require opentargets/json_schema#95 to be merged by the time of the next release.
PR has been merged
The test file provided by @tskir on 28/09/20 contained all clinical significance levels so the same is expected for the final 20.11 file. @DSuveges is now working on the new scoring (#1138)
These are the clinical significance counts in the 20.11 submission (based on ClinVar 2020/08):
Clinical significance array | Counts |
---|---|
["uncertain significance"] | 413,652 |
["likely benign"] | 126,194 |
["pathogenic"] | 103,961 |
["benign"] | 84,477 |
["likely pathogenic"] | 41,726 |
["not provided"] | 8,864 |
["conflicting interpretations of pathogenicity"] | 8,240 |
["benign","likely benign"] | 3,445 |
["likely pathogenic","pathogenic"] | 3,004 |
["other"] | 2,098 |
["risk factor"] | 481 |
["drug response"] | 197 |
["association"] | 170 |
["affects"] | 133 |
["protective"] | 29 |
["pathogenic","risk factor"] | 11 |
["benign","other"] | 8 |
["risk factor","uncertain significance"] | 6 |
["drug response","pathogenic"] | 6 |
["affects","pathogenic"] | 4 |
["conflicting interpretations of pathogenicity","risk factor"] | 3 |
["benign","likely benign","other"] | 3 |
["association not found"] | 3 |
["other","uncertain significance"] | 2 |
["likely pathogenic","risk factor"] | 2 |
["likely benign","risk factor"] | 2 |
["likely benign","other"] | 2 |
["benign","risk factor"] | 2 |
["association","pathogenic"] | 2 |
["likely pathogenic","pathogenic","risk factor"] | 1 |
["likely pathogenic","other","pathogenic"] | 1 |
["affects","likely pathogenic"] | 1 |
Include other clinical significance variants: we’will score them as 0 but present the information