EVA/ClinVar - include other clinical significant variant

DSuveges commented 4 years ago

Include other clinical significance variants: we’will score them as 0 but present the information

AsierGonzalez commented 4 years ago

There is no action in our side, EVA need to add this to their pipeline. I'll make sure that Kirill has a ticket for this.

AsierGonzalez commented 4 years ago

Better said, the action on our side is to adapt the JSON schema so that it accepts all the clinical significance values used by ClinVar. The ones currently accepted are below:

"clinical_significance": {
                  "type": "string",
                  "enum": [
                    "Pathogenic",
                    "Likely pathogenic",
                    "protective",
                    "association",
                    "risk_factor",
                    "Affects",
                    "drug response"
                  ]
                }

Extracted from the JSON schema.

AsierGonzalez commented 4 years ago

This table in the ClinVar documentation suggests that the list above may be incomplete: Screenshot 2020-07-06 at 15 29 03

AsierGonzalez commented 4 years ago

Kirill has agreed to report all variants (work can be tracked on EBIvariation/eva-opentargets#122 and he will provide a list of the possible clinical significance values soon.

tskir commented 4 years ago

@DSuveges @AsierGonzalez (also copying @tcezard @d0choa @iandunham)

Here is the distribution of all possible clinical significance levels in ClinVar: clinical-significance

Under the current criteria, only 188,518 out of 1,114,689 (17%) records contain clinical significance levels which are processed by our pipeline. The ones which are not processed fall into two categories:

Simple, where only one value is associated with a given record;
Complex, where there are multiple values, separated using , or /.

To start processing the types which are currently not processed, we need to do several things:

The new types must be supported by the Open Targets JSON schema, as mentioned by Asier in the comment above.
Specifically, we need to sort out the issue with spellings. I'm not certain about it, but it looks like we're currently not processing two types simply because they are spelled differently between ClinVar, our pipeline, and the JSON schema: risk_factor / risk factor and affects/Affects. To resolve this, I suggest specifying all values in the JSON schema in lowercase and using spaces, e.g.: risk factor, affects, likely pathogenic, etc. Where ClinVar spells them differently, we'll handle the conversion in our pipeline.
To include the new clinical significance levels, we need to know which activity identifiers to map them to. For example, currently likely pathogenic is mapped to http://identifiers.org/cttv.activity/predicted_damaging. We need to have such mappings for all clinical significance types.
It's unclear what to do with types which are either non-specific (not provided; Conflicting interpretations of pathogenicity; other) or combine multiple values (Pathogenic, Affects; Benign/Likely benign).

Please let me know how you would like to proceed here.

DSuveges commented 4 years ago

Thank you very much for doing all these extensive comparisons @tskir! It's really helpful in making the decisions. Let me go through all the points you have raised.

Yes, to accommodate the decisions here and on other tickets #1137, #1140 we'll need to update the json schema.
Your suggestion regarding the spelling of the clinical significance terms is good: let's use only lowercase terms with spaces instead underscores.
It seems we don't really use the .target.activity value anywhere so just omit that. We'll make that value in the json schema optional.
To accommodate multiple clinical significance values, we'll change the json schema to handle list instead of a string like:

"clinical_significance": [
   "pathogenic",
   "risk factor"
],

tskir commented 4 years ago

@DSuveges (copying @tcezard)

I agree with all the points you raised. I will proceed to implement the changes on our side in an experimental branch.

tskir commented 4 years ago

@DSuveges I'm currently working on this issue and I have a follow-up question about the ['evidence']['gene2variant']['is_associated'] and ['evidence']['variant2disease']['is_associated'] attributes. In the existing approach, we always set them to True, because only (likely) pathogenic and other related variants are processed. Should I continue to always set them both to True, even for variants which are benign etc.?

tskir commented 4 years ago

Or rather, to be precise, there is currently a calculation of both of those attributes which looks like this:

self.association = clinvar_record.clinical_significance not in ('likely benign', 'benign')

However, under the current approach, it always sets them both to True since "benign" and "likely benign" variants are being filtered out even before going into the evidence string generation phase.

Should I keep the calculation in exactly this way? Should I add more clinical significance levels which should not be considered an "association"? Or should I just always set them to True regardless?

DSuveges commented 4 years ago

Hi Kiril,

We have discussed this question with the team. There's a bit of history behind this value: it supposed to distinguish between positive and negative evidence hence the true/false binary. In this interpretation, this value should remain to be set true even for benign and likely benign evidences.

Thank you so much and please just ask if you have any other questions.

tskir commented 4 years ago

@AsierGonzalez @DSuveges This is done, currently undergoing our internal review. Changes are described in depth in our PR https://github.com/EBIvariation/eva-opentargets/pull/146. Summary:

Remove all filtering by clinical significance
Process and format the clinical significance levels: 'Benign/Likely benign, risk_factor' → ['benign', 'likely benign', 'risk factor']
Always set the evidence.gene2variant.is_associated and evidence.variant2disease.is_associated fields to True
Get rid of the the obsolete target.activity attribute
Adjust the schema: https://github.com/opentargets/json_schema/pull/95

I also ran the evidence string comparison protocol, and, as expected, their number rose significantly: from 116023 (batch 2020-09, standard pipeline) to 499174 (batch 2020-09, updated pipeline), an increase of 4.3×

AsierGonzalez commented 4 years ago

That sounds fantastic, thanks @tskir

tskir commented 4 years ago

Since there were no corrections raised for this issue, and since it already supports both types of evidence strings, it's been reviewed and merged on our side. To be compatible with the JSON schema, it will require opentargets/json_schema#95 to be merged by the time of the next release.

AsierGonzalez commented 4 years ago

PR has been merged

AsierGonzalez commented 3 years ago

The test file provided by @tskir on 28/09/20 contained all clinical significance levels so the same is expected for the final 20.11 file. @DSuveges is now working on the new scoring (#1138)

AsierGonzalez commented 3 years ago

These are the clinical significance counts in the 20.11 submission (based on ClinVar 2020/08):

Clinical significance array	Counts
["uncertain significance"]	413,652
["likely benign"]	126,194
["pathogenic"]	103,961
["benign"]	84,477
["likely pathogenic"]	41,726
["not provided"]	8,864
["conflicting interpretations of pathogenicity"]	8,240
["benign","likely benign"]	3,445
["likely pathogenic","pathogenic"]	3,004
["other"]	2,098
["risk factor"]	481
["drug response"]	197
["association"]	170
["affects"]	133
["protective"]	29
["pathogenic","risk factor"]	11
["benign","other"]	8
["risk factor","uncertain significance"]	6
["drug response","pathogenic"]	6
["affects","pathogenic"]	4
["conflicting interpretations of pathogenicity","risk factor"]	3
["benign","likely benign","other"]	3
["association not found"]	3
["other","uncertain significance"]	2
["likely pathogenic","risk factor"]	2
["likely benign","risk factor"]	2
["likely benign","other"]	2
["benign","risk factor"]	2
["association","pathogenic"]	2
["likely pathogenic","pathogenic","risk factor"]	1
["likely pathogenic","other","pathogenic"]	1
["affects","likely pathogenic"]	1

opentargets / issues

EVA/ClinVar - include other clinical significant variant #1139