opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

EVA/ClilnVar - Include mode of inheritance #1140

Closed DSuveges closed 3 years ago

AsierGonzalez commented 4 years ago

It seems that specifying the mode of inheritance in ClinVar is up to the submitters and it seems to appear in the variant page:

When a submitter specifies a mode of inheritance, that value is displayed on the variant page on the "Clinical assertions" tab in parentheses in the Conditions column, as in this example

From info here: https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/#mode

AsierGonzalez commented 4 years ago

We need to check with Kirill if it is possible to extract this information.

AsierGonzalez commented 4 years ago

Kirill is working on this as part of his EBIvariation/eva-opentargets#123 ticket.

tskir commented 4 years ago

@DSuveges @AsierGonzalez (copying @tcezard)

Yes it's possible to extract this information, but is only appears in a minority of ClinVar records: mode-of-inheritance

Please let me know how would you like to proceed here; especially, how do I specify the inheritance mode in the evidence strings.

DSuveges commented 4 years ago

Although this inheritance information is sparse, we still believe it has high value when it is available. We would like to have it in both the genetic_association (in .evidence.variant2disease) and in somatic_mutation (in .evidence)

"mode_of_inheritance": "Autosomal dominant inheritance"

(Currently the structure of genetic_association and somatic_mutation are very different although the underlying ClinVar data model is very similar or even identical. So in the long run, we are planning to make the structure of those two types converge. But no timeline for this at the moment)

tskir commented 4 years ago

@DSuveges (copying @tcezard)

Okay, then I'll go on to implement it in the way you described. Two comments:

DSuveges commented 4 years ago

This json_schema branch now has the following definition to store mode of inheritance:

                "mode_of_inheritance": {
                  "type": "string",
                  "description": "ClinVar mode of inheritance",
                  "enum": [
                    "autosomal dominant inheritance",
                    "autosomal recessive inheritance",
                    "somatic mutation",
                    "x-linked inheritance",
                    "autosomal unknown",
                    "x-linked recessive inheritance",
                    "sporadic",
                    "mitochondrial inheritance",
                    "x-linked dominant inheritance",
                    "modeofinheritance multiple",
                    "oligogenic inheritance",
                    "other",
                    "y-linked inheritance",
                    "sex-limited autosomal dominant",
                    "autosomal dominant inheritance with maternal imprinting",
                    "genetic anticipation",
                    "co-dominant",
                    "autosomal dominant inheritance with paternal imprinting",
                    "enigma rules, 2015"
                  ]
                }

So I have added the complete list seen on Kiril's graph. I am not 100% sure if this list is complete, or should be checked with enum. However it seems there are inconsistencies, which would be good to catch (eg. autosomal unknown vs Autosomal unknown).

tskir commented 4 years ago

Thanks @DSuveges! To answer your question:

I am not 100% sure if this list is complete

It is as of the latest (mid-July) ClinVar data. Of course, in the future they might add new types

or should be checked with enum

Yes, this is a difficult choice. I guess we can go with an enum for now, and if they keep changing the list of allowed values too frequently, we can think of other ideas.

DSuveges commented 4 years ago

Thanks. Let's try to enumerate the terms and review later.

AsierGonzalez commented 4 years ago

Using an enum here is overkill in my opinion. As you have said, if the list changes in the future we will need to check it an update it accordingly. The enum would be justified if we needed the mode of inheritance to take a value from a closed list for instance because we are going to use it to score. However, we are far from there because there are at least three other data sources (Genomics England PanelApp, Gene2Phenotype and ClinGen) that contain similar information but with different possible values (table kindly provided by @emcdonagh-OT):

PanelApp (list) Gene2Phenotype ClinGen
MONOALLELIC, autosomal or pseudoautosomal, not imprinted monoallelic Autosomal Dominant
MONOALLELIC, autosomal or pseudoautosomal, maternally imprinted (paternal allele expressed) imprinted N/A
MONOALLELIC, autosomal or pseudoautosomal, paternally imprinted (maternal allele expressed) imprinted N/A
MONOALLELIC, autosomal or pseudoautosomal, imprinted status unknown monoallelic Autosomal Dominant
BIALLELIC, autosomal or pseudoautosomal biallelic Autosomal Recessive
BOTH monoallelic and biallelic, autosomal or pseudoautosomal N/A N/A
BOTH monoallelic and biallelic, autosomal or pseudoautosomal (but BIALLELIC mutations cause a more SEVERE disease form), autosomal or pseudoautosomal N/A N/A
X-LINKED: hemizygous mutation in males, biallelic mutations in females hemizygous X-linked recessive
X linked: hemizygous mutation in males, monoallelic mutations in females may cause disease (may be less severe, later onset than males) X-linked dominant X-linked
MITOCHONDRIAL Mitochondrial N/A
Unknown Uncertain N/A
Other - please specify in evaluation comments Other
N/A N/A Semidominant
N/A Monoallelic (Y) N/A
N/A Digenic N/A
N/A Mosaic N/A
N/A X-linked over-dominance N/A

I would argue that if we only want to capture the information for now we should just accept an string.

AsierGonzalez commented 4 years ago

Also, I wonder what the right place for this information is. As you already said, for somatic_mutation evidence the mode of inheritance (inheritance_pattern) is located directly under evidence because that is all the evidence strings contain apart from the target and disease information. However, I am not so sure about placing it under variant2disease for genetics associations. Is the mode of inheritance an inherent property of the variant in that disease or is it related to the target or the variant? If it's the former, then the suggested location is fine, if it's the latter it should be a property of the target or the variant.

tskir commented 4 years ago

@AsierGonzalez

Having examined your arguments, I now agree that we should drop enum for mode of inheritance. I will also not do any attempts to format the values of that field (to lowercase or otherwise), and will just report them verbatim.

Regarding your second question: as far as I know, mode of inheritance is usually considered to be a property of the disease. For example, people say that mode of inheritance of Hemophilia A is X-linked recessive. However, it is mostly influenced by the properties of the gene: where it is located (autosome, allosome, or mitochondrial genome) and how variants in that gene (especially LOF ones) impact the phenotype associated with the disease, based on their zygosity. I also believe there are some rare cases where different variants in the same gene might have different modes of inheritance for the same disease.

It might be a good idea to summon someone with a better understanding of biology and clinical genetics. As far as I see, given that mode of inheritance depends on all three concepts (variant, disease, gene), using the variant2disease section seems to be a sensible approach.

tskir commented 4 years ago

@AsierGonzalez @DSuveges

Also, as I see, currently in the JSON schema the mode_of_inheritance block is defined only once, inside variant2disease. Meaning, once again, that I will only be able to include it for genetic_association evidence strings. Is this intended?

tskir commented 4 years ago

This is now also implemented, undergoing our internal review (https://github.com/EBIvariation/eva-opentargets/pull/149), and also I submitted another PR with changes to the OT JSON schema (opentargets/json_schema/pull/97).

tskir commented 4 years ago

While testing the change, I stumbled upon the fact that there were a few dozen records which have multiple “mode of inheritance” attributes specified. In fact, they are specified in the Sankey diagram above as “ModeOfInheritance multiple”, I just forgot about them up until the testing phase.

For now, I solved this problem by amending the PR for the JSON schema to accept arrays in the mode_of_inheritance attributes as well, and by modifying our pipeline to always store those attributes in an array. If needed, we can discuss this and this behaviour can be changed.

AsierGonzalez commented 4 years ago

It might be a good idea to summon someone with a better understanding of biology and clinical genetics. As far as I see, given that mode of inheritance depends on all three concepts (variant, disease, gene), using the variant2disease section seems to be a sensible approach.

@iandunham has confirmed that this is how it should be, that is, the mode of inheritance is a property of all those together. So there is no need to relocate it.

Also, as I see, currently in the JSON schema the mode_of_inheritance block is defined only once, inside variant2disease. Meaning, once again, that I will only be able to include it for genetic_association evidence strings. Is this intended?

This was probably an oversight and we will fix it when we work on improving the alignment between the genetic associations and somatic mutations. If you are constrained by the JSON schema to generate the evidence, that is, if you can only represent fields that are explicitly defined in it, just leave it out of the somatic evidence strings for now. However, if you are flexible and you can add it to .evidence in the latter group, that would be useful for us to explore the data.

For now, I solved this problem by amending the PR for the JSON schema to accept arrays in the mode_of_inheritance attributes as well, and by modifying our pipeline to always store those attributes in an array. If needed, we can discuss this and this behaviour can be changed.

Specifying the mode of inheritance in an array is a solution but given than only 33 out of the 35,009 records have multiple modes of inheritance I would suggest that first we check what they represent and whether we could handle them in an alternative way. The mode of inheritance will be provided by other data sources so we should try to make it as generic as possible. If an array is finally required we may need to include it in a oneOf construct but we could do that ourselves.

tskir commented 4 years ago

@AsierGonzalez Thank you for the detailed comments. Now to address them:

Multiple ModeOfInheritance attributes

So that we don't have to decide blindly, here is the distribution of all cases where a ClinVar record has multiple ModeOfInheritance attributes. You may notice that the total is 42 instead of 33—this is because I'm using a newer ClinVar data release than the one in https://github.com/opentargets/platform/issues/1140#issuecomment-662010894.

     10 Autosomal dominant inheritance|Sporadic
      7 Autosomal dominant inheritance|Autosomal recessive inheritance
      6 Autosomal dominant inheritance|Somatic mutation
      5 Autosomal dominant inheritance|Autosomal unknown
      4 X-linked dominant inheritance|X-linked inheritance
      3 Autosomal recessive inheritance|autosomal unknown
      2 Sporadic|X-linked inheritance
      2 X-linked inheritance|X-linked recessive inheritance
      1 Autosomal recessive inheritance|Autosomal unknown
      1 Autosomal recessive inheritance|X-linked recessive inheritance
      1 Autosomal unknown|Sporadic

It looks to me that those combinations of values look quite sensible and should be represented in the evidence strings. Even though those cases are a minority, we should still try to represent all possible data states. (Case in point: having multiple clinical significance values is also a minority, but we support an array of them anyway.)

I agree that in this case it's better to have a oneOf choice between an array and a string, to make it easier for other submitters. However, I think that for consistency, it would be better if our pipeline always used the array, even when there is only one value (again, similarly to how we always use an array for clinical significance levels, even when it's just one).

Alignment with JSON schema changes

Just to clarify, in our pipeline we are not restricted to only specifying the values currently present in the OT JSON schema. However, I believe it is a good practice to introduce changes to the code & the schema at the same time, to avoid significant divergence between them. It's not difficult for me at all to introduce those changes and to submit the corresponding PRs.

tskir commented 4 years ago

This is now reviewed & merged on our side. The current implementation is as described in my previous comment: specifying the modes of inheritance both for genetic_association and somatic_mutation evidence strings, and always using an array.

I also updated https://github.com/opentargets/json_schema/pull/97 accordingly.

AsierGonzalez commented 3 years ago

Thank you for breaking the combinations down @tskir. You say that the combinations of values look sensible but I struggle to understand how the same disease caused by the same mutation can have multiple modes of inheritance. Is this biologically possible or a way to represent that there is uncertainty about it?

I am just trying to be sure that this is not low confidence data that should be ignored. Otherwise you solution looks good.

AsierGonzalez commented 3 years ago

Comment by @iandunham:

Some of these dual ones look like they are just evolving knowledge. i.e. the mode has got cleared up. The X linked ones are weird - could be that the dominant mode was given for a male case and recessive for a female case. Autosomal recessive inheritance|X-linked recessive inheritance is just a mess as it shouldn't be able to be both. This all looks like it requires some detailed investigation of the specific cases

So I think we should look into some of those more in detail. Could you provide the ids of those ClinVar entries so that we can look into them?

tskir commented 3 years ago

@AsierGonzalez @iandunham Sure, here is the complete table, extracted from the latest ClinVar XML as of 2020-09-16:

All cases of multiple ModeOfInheritance attributes

RCV ID Modes of inheritance
RCV000063117 Autosomal dominant inheritance; Autosomal recessive inheritance
RCV000190094 Autosomal dominant inheritance; Autosomal recessive inheritance
RCV000275624 Autosomal dominant inheritance; Autosomal recessive inheritance
RCV000399339 Autosomal dominant inheritance; Autosomal recessive inheritance
RCV000918466 Autosomal dominant inheritance; Autosomal recessive inheritance
RCV000948503 Autosomal dominant inheritance; Autosomal recessive inheritance
RCV001096156 Autosomal dominant inheritance; Autosomal recessive inheritance
RCV000086943 Autosomal dominant inheritance; Autosomal unknown
RCV000087353 Autosomal dominant inheritance; Autosomal unknown
RCV000087631 Autosomal dominant inheritance; Autosomal unknown
RCV000344502 Autosomal dominant inheritance; Autosomal unknown
RCV000495606 Autosomal dominant inheritance; Autosomal unknown
RCV000109523 Autosomal dominant inheritance; Somatic mutation
RCV000287932 Autosomal dominant inheritance; Somatic mutation
RCV000372784 Autosomal dominant inheritance; Somatic mutation
RCV000507857 Autosomal dominant inheritance; Somatic mutation
RCV000507858 Autosomal dominant inheritance; Somatic mutation
RCV001025520 Autosomal dominant inheritance; Somatic mutation
RCV000141918 Autosomal dominant inheritance; Sporadic
RCV000344359 Autosomal dominant inheritance; Sporadic
RCV000346120 Autosomal dominant inheritance; Sporadic
RCV000919753 Autosomal dominant inheritance; Sporadic
RCV000963036 Autosomal dominant inheritance; Sporadic
RCV000966069 Autosomal dominant inheritance; Sporadic
RCV001139707 Autosomal dominant inheritance; Sporadic
RCV001141206 Autosomal dominant inheritance; Sporadic
RCV001455265 Autosomal dominant inheritance; Sporadic
RCV001741488 Autosomal dominant inheritance; Sporadic
RCV000086819 Autosomal recessive inheritance; autosomal unknown
RCV000086873 Autosomal recessive inheritance; autosomal unknown
RCV000087680 Autosomal recessive inheritance; autosomal unknown
RCV000077525 Autosomal recessive inheritance; Autosomal unknown
RCV001954859 Autosomal recessive inheritance; X-linked recessive inheritance
RCV000350399 Autosomal unknown; Sporadic
RCV000070093 Sporadic; X-linked inheritance
RCV000146966 Sporadic; X-linked inheritance
RCV000070104 X-linked dominant inheritance; X-linked inheritance
RCV000398754 X-linked dominant inheritance; X-linked inheritance
RCV000401226 X-linked dominant inheritance; X-linked inheritance
RCV001647962 X-linked dominant inheritance; X-linked inheritance
RCV000087558 X-linked inheritance; X-linked recessive inheritance
RCV000529121 X-linked inheritance; X-linked recessive inheritance
Note to self — command used to extract this from logs ```bash grep '^Multiple ModeOfInheritance' log | cut -f2,3 | sort -t$'\t' -k2,2 -k1,1 | sed -e 's/|/; /g' | clipboard ```
AsierGonzalez commented 3 years ago

I have looked into those clinvar accessions and what I have seen does not look very encouraging. Summary is below but you can see all the information in the spreadsheet:

There are three things that we could check to find out how widespread this issue is:

However, considering that we are not planning to use the mode of inheritance for scoring but just to display it in the evidence table, we may just want to use it as is.

tskir commented 3 years ago

@AsierGonzalez (also copying @tcezard)

Thank you for looking into this! I have two considerations on the matter:

  1. Given that there are legitimate cases where multiple modes of inheritance can be present for the same record, I propose that we leave the current way of reporting them in our pipeline unchanged. That is, to continue always reporting an array of values.
  2. While at some point we should check all those things you mentioned, it would take a fair bit of time. Given that modes of inheritance will not yet be used for scoring, I propose that we add those investigations to the backlog to be addressed at some point in the future. I've created a ticket on our side: https://github.com/EBIvariation/eva-opentargets/issues/157
AsierGonzalez commented 3 years ago

I agree, let's use it as is for now.

tskir commented 3 years ago

As discussed here, no additional work on mode of inheritance will be required.

Regarding other ClinVar investigation findings: minor improvements on allele origin and cohortPhenotypes will be implemented separately and do not require tracking on our side.