openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Add support for mitochondrial variants #134

Closed ifokkema closed 4 years ago

ifokkema commented 4 years ago

Mitochondrial variants fail in the Variant Validator. The Variant Formatter works (partially). Example: NC_012920.1:m.3243A>G

The web interface says:

Unable to validate the submitted variant NC_012920.1:m.3243A>G against the GRCh38 assembly. The following warnings were returned:

Please check your submission and re-submit.

(GRCh37 and GRCh38 both return the same problem)

There isn't actually an error shown. Replacing m. by g. does not solve the problem.

The API returns a "flag": "warning" and a "validation_warning_1" element, which doesn't really seem to contain an error message.

Once this is fixed, I'd like to propose a method to be able to map to genes (transcripts) on the mitochondrial DNA.

Peter-J-Freeman commented 4 years ago

Thanks. Will take a look

There is currently no method for mapping onto transcripts because RefSeq have not yet released HGVS compliant reference transcripts for the mitochondrial genome. These are due, but currently all are "model". We need to discuss this opening a new issue

p.s. is this the "hg19" mitochondrial chromosome or the GRCh37? I ask because I know that the hg19 version is not yet in UTA or SeqRepo. We need to add it during database development. Opening a new issue

Peter-J-Freeman commented 4 years ago

https://github.com/openvar/variantValidator/issues/135

Peter-J-Freeman commented 4 years ago

For discussion over missing or model sequences, please see the new issue linked above. I'm keeping this one open to see what the VF mito behaviour is

ifokkema commented 4 years ago

p.s. is this the "hg19" mitochondrial chromosome or the GRCh37? I ask because I know that the hg19 version is not yet in UTA or SeqRepo. We need to add it during database development. Opening a new issue

No, this is the NCBI GRCh37 one. So that's probably why VF can work with it.

Peter-J-Freeman commented 4 years ago

Thanks. I need to check the VV and VF behaviour again then. Know VV can handle it, but M. was not expressed as a user request for VF.

Peter-J-Freeman commented 4 years ago

Hi @ifokkema . Doing some coding for a change. Woo hoo.

Here is the vv output for the variant you described

{
    "flag": "warning",
    "metadata": {
        "seqrepo_db": "2018-08-21",
        "uta_schema": "uta_20180821",
        "variantvalidator_hgvs_version": "1.2.5.vv1",
        "variantvalidator_version": "1.0.4.dev11+g97aec97"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_012920.1:m.3243A>G",
                "vcf": {
                    "alt": "G",
                    "chr": "M",
                    "pos": "3243",
                    "ref": "A"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_012920.1:m.3243A>G",
                "vcf": {
                    "alt": "G",
                    "chr": "M",
                    "pos": "3243",
                    "ref": "A"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_012920.1:m.3243A>G",
                "vcf": {
                    "alt": "G",
                    "chr": "chrM",
                    "pos": "3243",
                    "ref": "A"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_012920.1:m.3243A>G",
                "vcf": {
                    "alt": "G",
                    "chr": "chrM",
                    "pos": "3243",
                    "ref": "A"
                }
            }
        },
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "submitted_variant": "NC_012920.1:m.3243A>G",
        "transcript_description": "Homo sapiens mitochondrion, complete genome",
        "validation_warnings": []
    }
}

The warning flag needs to be changed. I will create a mitochondrialvariant flag. However, I will need to review this when RefSeq release NM mitochondrial transcripts.

Peter-J-Freeman commented 4 years ago

Output now looks like this for variant NC_012920.1:m.3243A>G

{
    "flag": "mitochondrial",
    "metadata": {
        "seqrepo_db": "2018-08-21",
        "uta_schema": "uta_20180821",
        "variantvalidator_hgvs_version": "1.2.5.vv1",
        "variantvalidator_version": "1.0.4.dev11+g97aec97"
    },
    "mitochondrial_variant_1": {
        "alt_genomic_loci": [],
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_012920.1:m.3243A>G",
                "vcf": {
                    "alt": "G",
                    "chr": "M",
                    "pos": "3243",
                    "ref": "A"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_012920.1:m.3243A>G",
                "vcf": {
                    "alt": "G",
                    "chr": "M",
                    "pos": "3243",
                    "ref": "A"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_012920.1:m.3243A>G",
                "vcf": {
                    "alt": "G",
                    "chr": "chrM",
                    "pos": "3243",
                    "ref": "A"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_012920.1:m.3243A>G",
                "vcf": {
                    "alt": "G",
                    "chr": "chrM",
                    "pos": "3243",
                    "ref": "A"
                }
            }
        },
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "GRCh37",
        "submitted_variant": "NC_012920.1:m.3243A>G",
        "transcript_description": "Homo sapiens mitochondrion, complete genome",
        "validation_warnings": []
    }
}
Peter-J-Freeman commented 4 years ago

Hi @ifokkema . Done a little more digging NC_001807.4 is NOT the GRCh37 mito genome, it is the hg19 mito genome. The GRCh37 is NC_012920.1 which is also used in GRCh38. There is a discrepancy between GRCh37 and hg19.

Seems to be working with VV but VF gives the warning Failed to fetch NC_001807.4 from SeqRepo. So I suspect I did something to VV to sort the issue. I'll keep digging.

Peter-J-Freeman commented 4 years ago

OK, so VF returns the warning "Failed to fetch NC_001807.4 from SeqRepo." if hg19 is selected, So what's happening here is that when hg19 is selected VF is seeing M and automatically selecting NC_001807.4 when reverse mapping to VCF. I'm not entirely sure how to correct the issue because it's a quirk. I suggest sticking to GRCh37 rather than hg19 for m.

{
  "NC_012920.1:m.3243A>G": {
    "NC_012920.1:m.3243A>G": {
      "g_hgvs": "NC_012920.1:g.3243A>G",
      "genomic_variant_error": null,
      "hgvs_t_and_p": {
        "intergenic": {
          "alt_genomic_loci": {
            "grch37": {},
            "grch38": {},
            "hg19": {},
            "hg38": {}
          },
          "primary_assembly_loci": {
            "grch37": {
              "NC_012920.1": {
                "hgvs_genomic_description": "NC_012920.1:g.3243A>G",
                "vcf": {
                  "alt": "G",
                  "chr": "M",
                  "pos": "3243",
                  "ref": "A"
                }
              }
            },
            "grch38": {},
            "hg19": {
              "NC_012920.1": {
                "hgvs_genomic_description": "NC_012920.1:g.3243A>G",
                "vcf": {
                  "alt": "G",
                  "chr": "chrM",
                  "pos": "3243",
                  "ref": "A"
                }
              }
            },
            "hg38": {}
          }
        }
      },
      "p_vcf": "M:3243:A:G",
      "selected_build": "GRCh37"
    },
    "errors": [],
    "flag": null
  },
  "metadata": {
    "seqrepo_db": "/local/seqrepo/2018-08-21",
    "uta_schema": "uta_20180821",
    "variantformatter_version": "1.0.2.dev8+gc4645b3",
    "variantvalidator_hgvs_version": "1.2.5.vv1",
    "variantvalidator_version": "1.0.4.dev11+g97aec97"
  }
}

This is as full a validation as this tool can manage. There are no mito NM_ currently. We can consider adding transcripts. Also, the liftoiver does not include mito, but GRCh37 and GRCh38 are identical.

Peter-J-Freeman commented 4 years ago

So, now I need to establish what happens with NC_001807.4

Variant NC_001807.4:m.2941C>G

VV returns as expected

{
    "flag": "mitochondrial",
    "metadata": {
        "seqrepo_db": "2018-08-21",
        "uta_schema": "uta_20180821",
        "variantvalidator_hgvs_version": "1.2.5.vv1",
        "variantvalidator_version": "1.0.4.dev11+g97aec97"
    },
    "mitochondrial_variant_1": {
        "alt_genomic_loci": [],
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "hg19",
        "submitted_variant": "NC_001807.4:m.2941C>G",
        "transcript_description": "",
        "validation_warnings": [
            "Failed to fetch NC_001807.4 from SeqRepo (/Users/Shared/seqrepo_dumps/2018-08-21) (Alias NC_001807.4 (namespace: None))"
        ]
    }
}

I have asked John to add the referense sequence to SeqRepo

Peter-J-Freeman commented 4 years ago

Unable to validate the submitted variant NC_012920.1:m.3243A>G against the GRCh38 assembly. The following warnings were returned:

Please check your submission and re-submit.

This is looking like a Web interface issue also.

ifokkema commented 4 years ago

Thank you for your work on this! I assume this also applies to the LOVD endpoint?

Although internally, LOVD uses hg19, we're actually using NC_012920.1 so technically we're using the wrong build name. :fearful: For chrM variants, I will send GRCh37 as build if hg19 is rejected.

Peter-J-Freeman commented 4 years ago

Closes for web https://github.com/openvar/VVweb/commit/99ad694b87c774ed3095d68a5161134a210133ed Closes for VV https://github.com/openvar/variantValidator/commit/616aa27803955b14e33abfb5abeaa8ee39a5ec86

Peter-J-Freeman commented 4 years ago

Hi @ifokkema . Yep, I have not had to change VF so the LOVD endpoint is fine. Just needs GRCh37 for mito variants as you identified

Peter-J-Freeman commented 4 years ago

Updated vvWeb and dev server with new code. Good to go

ifokkema commented 4 years ago

Excellent, thanks! I've modified my code to now always send the NCBI's build names instead of the UCSC build names.

I noticed VV corrects my m. notation to g. notation, although currently, the HGVS nomenclature still states that m. should be used for mitochondrial DNA. Is that something you want to include as well? If not, I'll compensate by changing the g. back to an m. myself.

Peter-J-Freeman commented 4 years ago

Hi @ifokkema . I'm going to make sure both VV and VF output m. for the mito genomes rather than g. I will update in a later version and let you know when I intend to roll it out. Just wanted to make sure you are aware

ifokkema commented 4 years ago

Excellent, thank you! Then I will leave my code as-is :wink: I'll add some unit tests later and will keep this in mind.