openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Variant Validator doesn't really support NG variants. #423

Closed ifokkema closed 1 year ago

ifokkema commented 2 years ago

Sorry for causing more headaches...!

Describe the bug VV doesn't really support NG variants, and if it sort of works, it shows some redundant validation warnings when processing NG variants. I actually found out my VV library doesn't process NG variants at all as I don't use them in LOVD and apparently never implemented it. But for the new HGVS validation interface, I need to add support for it, and I ran into some issues.

To Reproduce My use case is checking NG_011986.1:g.6001del. It should be corrected to NG_011986.1:g.6002del. Calling VV, picking a random build and selecting all transcripts: https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NG_011986.1%3Ag.6001del/all?content-type=application%2Fjson

  1. OK, I can deal with VV being focused on transcripts and therefore returning me NM-based variants while I inserted an NG-based one. But I have to really search very hard for my corrected variant. The first transcript mapping, NM_001159508.1:c.153+515del, has absolutely no mention of any NG-based variant. The second mapping, NM_001159508.2:c.144+515del, also has an empty hgvs_refseqgene_variant field. More than a dozen transcript mappings later, I finally find my corrected input. Transcript mapping NM_002225.3:c.153+515del mentions "hgvs_refseqgene_variant": "NG_011986.1:g.6002del". Interestingly enough, transcript mapping NM_002225.5:c.144+515del mentions NG_011986.2:g.6002del, which is a different NG version! None of the other 21 transcript mappings report my corrected NG variant. I assume this is because these transcripts are not in the NG. In that case, I don't want to see them, either. I can't just randomly select a transcript since I just have the NG input. I could have used "select" here as a filter (but not "mane_select"), although I'm afraid that won't always help me. Also, dropping the 22 transcripts that I can't use would speed up the VV process a lot.
  2. The validation warnings often start with Removing redundant reference bases from variant description. But, I didn't provide any redundant reference bases. I believe this is an artifact of an internal VV mapping to the NC. I think dropping the mapping to the NC to fix the first issue above here will likely resolve this issue as well.
  3. For every transcript mapping provided, VV mentions dozens of warnings about other reference sequences that have a new version. E.g., for NM_001159508.1:c.153+515del, I get (shortened JSON):
    {
    "NM_001159508.1:c.153+515del": {
    "hgvs_refseqgene_variant": "",
    "selected_assembly": "hg19",
    "submitted_variant": "NG_011986.1:g.6001del",
    "validation_warnings": [
      "Removing redundant reference bases from variant description",
      "A more recent version of the selected reference sequence NM_001159508.1 is available (NM_001159508.3): NM_001159508.3:c.153+515del MUST be fully validated prior to use in reports: select_variants=NM_001159508.3:c.153+515del",
      "A more recent version of the selected reference sequence NM_001159508.2 is available (NM_001159508.3): NM_001159508.3:c.144+515del MUST be fully validated prior to use in reports: select_variants=NM_001159508.3:c.144+515del",
      "A more recent version of the selected reference sequence NM_001354597.1 is available (NM_001354597.3): NM_001354597.3:c.96+136del MUST be fully validated prior to use in reports: select_variants=NM_001354597.3:c.96+136del",
      "A more recent version of the selected reference sequence NM_001354597.2 is available (NM_001354597.3): NM_001354597.3:c.96+136del MUST be fully validated prior to use in reports: select_variants=NM_001354597.3:c.96+136del",
      "A more recent version of the selected reference sequence NM_001354598.1 is available (NM_001354598.3): NM_001354598.3:c.153+515del MUST be fully validated prior to use in reports: select_variants=NM_001354598.3:c.153+515del",
      "A more recent version of the selected reference sequence NM_001354598.2 is available (NM_001354598.3): NM_001354598.3:c.144+515del MUST be fully validated prior to use in reports: select_variants=NM_001354598.3:c.144+515del",
      "A more recent version of the selected reference sequence NM_001354599.1 is available (NM_001354599.3): NM_001354599.3:c.153+515del MUST be fully validated prior to use in reports: select_variants=NM_001354599.3:c.153+515del",
      "A more recent version of the selected reference sequence NM_001354599.2 is available (NM_001354599.3): NM_001354599.3:c.144+515del MUST be fully validated prior to use in reports: select_variants=NM_001354599.3:c.144+515del",
      "A more recent version of the selected reference sequence NM_001354600.1 is available (NM_001354600.3): NM_001354600.3:c.153+515del MUST be fully validated prior to use in reports: select_variants=NM_001354600.3:c.153+515del",
      "A more recent version of the selected reference sequence NM_001354600.2 is available (NM_001354600.3): NM_001354600.3:c.144+515del MUST be fully validated prior to use in reports: select_variants=NM_001354600.3:c.144+515del",
      "A more recent version of the selected reference sequence NM_001354601.1 is available (NM_001354601.3): NM_001354601.3:c.153+515del MUST be fully validated prior to use in reports: select_variants=NM_001354601.3:c.153+515del",
      "A more recent version of the selected reference sequence NM_001354601.2 is available (NM_001354601.3): NM_001354601.3:c.144+515del MUST be fully validated prior to use in reports: select_variants=NM_001354601.3:c.144+515del",
      "A more recent version of the selected reference sequence NM_002225.3 is available (NM_002225.5): NM_002225.5:c.153+515del MUST be fully validated prior to use in reports: select_variants=NM_002225.5:c.153+515del",
      "A more recent version of the selected reference sequence NM_002225.4 is available (NM_002225.5): NM_002225.5:c.144+515del MUST be fully validated prior to use in reports: select_variants=NM_002225.5:c.144+515del",
      "A more recent version of the selected reference sequence NR_148925.1 is available (NR_148925.2): NR_148925.2:n.554+136del MUST be fully validated prior to use in reports: select_variants=NR_148925.2:n.554+136del"
    ]
    }
  4. Finally, some variants are regarded intergenic. E.g., trying to figure out if NG_011986.1:g.1001del is a correct variant description is not possible because no transcripts map to that position. But, I don't need mapping, just validation.

Expected behavior Either restrict the output to transcripts mapping on the NG, allowing me to find my input easily. Or, allowing NG variants through the VF and LOVD endpoints, where I can turn off mapping and just validate the variant.

Peter-J-Freeman commented 2 years ago

For every transcript mapping provided, VV mentions dozens of warnings about other reference sequences that have a new version. E.g., for NM_001159508.1:c.153+515del, I get (shortened JSON):

Yeah. That's on my radar. Thanks for opening this

OK, I can deal with VV being focused on transcripts

Not really. It is genome/transcript based. RefSeqGenes are kind of a side-show because they are poorly maintained by RefSeq. To be honest, RefSeq don't really appear to believe in them, especialy since LRG died off! If they don't create good data, we are limited, so the following workflow I believe was chose (although I would have to check)

RefSeqGene to a transcript, then to the genome build, then to all overlapping transcripts.

This was requested by clinicians who were fed up with not being able to map RSG variants to the transcripts they actually wanted and (like LRG) were fed up that barely any transcripts are annotated in RefSeq Gene records.

I really don't want to spend too much time on this, (OPTION 1) but I could perhaps add code that takes the argument select_transcripts=refseqgene which would only return the variant in the context of transcripts that actually are annotated to the specified RSG derived description. The output would be like this

https://rest.variantvalidator.org/VariantValidator/variantvalidator/GRCh38/NG_011986.1%3Ag.6001del/NM_002225.3?content-type=application%2Fjson. The advantage here would be that you would get a faster response, but you would not get mappings to any transcripts not annotated in the variant description.

Alternatively, (OPTION 2) I could filter out all results where hgvs_refseqgenevariant is not "", or pre-filter for mappings from / to a NG which would achieve the same but be loads quicker. It would look like this for this particular variant https://rest.variantvalidator.org/VariantValidator/variantvalidator/GRCh38/NG_011986.1%3Ag.6001del/NM_002225.3%7CNM_002225.5?content-type=application%2Fjson

What do you fancy????

VV doesn't really support NG variants, and if it sort of works

Er, we are thinking of dropping support for RefSeqGenes and LRGs, but will keep it since we have a good use case!-

Either restrict the output to transcripts mapping on the NG, allowing me to find my input easily. Or, allowing NG variants through the VF and LOVD endpoints, where I can turn off mapping and just validate the variant.

Technically, this is the case, you should just have to loop through for hgvs_refseqgene_variant is not "", but as I said above, this is not ideal. I have provided a couple of better options.

Can think about adding this to LOVD. Shouldn't take much additional coding once VV has been updated with one of the options.

ifokkema commented 2 years ago

Er, we are thinking of dropping support for RefSeqGenes and LRGs, but will keep it since we have a good use case!-

Hehe, well, there's a reason that I haven't run into this yet :wink: If LRGs and NGs were to be deprecated, then we can stop having to worry about them. But until that point, I think we're sort of forced to accept them :unamused:

What do you fancy???? (...) Can think about adding this to LOVD. Shouldn't take much additional coding once VV has been updated with one of the options.

If this would be an option, that would have my preference! Since the LOVD endpoint can turn off pretty much everything, I can then just have the variant validated and forget about the rest. If I would ever really need mapping to the genome or so, I would be fine using the transcript from the NG for that with a second request.

Otherwise, if the extension of VF/the LOVD endpoint isn't feasible, then I believe the select_transcripts=refseqgene option would be best as it'll result in the fastest execution of VV, focusing only on the transcript(s) in the NG file.

I see, by the way, that the same applies to LRGs.

Peter-J-Freeman commented 2 years ago

If this would be an option, that would have my preference! Since the LOVD endpoint can turn off pretty much everything, I can then just have the variant validated and forget about the rest. If I would ever really need mapping to the genome or so, I would be fine using the transcript from the NG for that with a second request.

Sorry, I was not clear. You need to pick a VV update option, then once I have fixed VV, I can integrate it into VF/LOVD. I think select_transcripts=refseqgene is the way to go, but do we focus on the single RefSeqGene submitted (OPTION1), or do you want liftover onto updated RefSeqGenes etc??? Or onto overlapping RefSeqGenes (OPTION 2)? I don't mind which option. We could perhaps do option 2 for VV and in VF add the additional parameter of liftover=true/false just as we do with genome buids???

I see, by the way, that the same applies to LRGs.

That's true, because we do not hold any LRG records, all LRGs are in facr RefSeqGenes but with 3rd party annotation ;P

ifokkema commented 2 years ago

Sorry, I was not clear. You need to pick a VV update option, then once I have fixed VV, I can integrate it into VF/LOVD. I think select_transcripts=refseqgene is the way to go (...)

Yes, I agree!

but do we focus on the single RefSeqGene submitted (OPTION1), or do you want liftover onto updated RefSeqGenes etc???

No, I really just want to validate the variant. All the mapping and liftover and other fancy things; we use NCs and NMs for. I just need the most rudimentary support for NGs for the HGVS validation interface - is the variant description correct, or not?

We could perhaps do option 2 for VV and in VF add the additional parameter of liftover=true/false just as we do with genome buids???

That'll depend on what other users would need... For me, the fewer transcripts I see, the better :grin:

I see, by the way, that the same applies to LRGs.

That's true, because we do not hold any LRG records, all LRGs are in facr RefSeqGenes but with 3rd party annotation ;P

That explains it :laughing:

Peter-J-Freeman commented 2 years ago

That'll depend on what other users would need... For me, the fewer transcripts I see, the better 😁

Yeah, but its my project and I want to keep it simple so OPTION 1 it is ;P

That explains it 😆

What happened to working on LOVD for 20 years. You didn't know this???? Basically, LRG is, RefSeq make a gene reference sequence, Ensembl fabricate some less that reliable annotation and call it an LRG. I have some alarming examples of this. We all thought MANE would be a terrible thing, but in fact it's been a complete boon. It means we can use RefSeq annotation from source and believe what we see and not have to double check what ensembl currated datasets tell us

Happy with option 1 then mate?

ifokkema commented 2 years ago

That'll depend on what other users would need... For me, the fewer transcripts I see, the better grin

Yeah, but its my project and I want to keep it simple so OPTION 1 it is ;P

Sounds like a win/win!

That explains it laughing

What happened to working on LOVD for 20 years. You didn't know this????

Haha I didn't know that you don't store any LRGs. Yes, I know there first had to be an NG before the LRG was created, but since they have actual records (XML - yuck, but better than plain-text NGs) they can be processed independently. I didn't know you didn't, but just took the NG directly. I'm kinda wondering how you do the transcript mapping then, since, I believe, the transcripts mappings are in the updatable part of the LRG. Anyway, not really relevant but well, you were making fun of me :stuck_out_tongue_closed_eyes:

Happy with option 1 then mate?

Absolutely!

Peter-J-Freeman commented 2 years ago

I'm kinda wondering how you do the transcript mapping then, since, I believe, the transcripts mappings are in the updatable part of the LRG

They are, but it's rare and RefSeqGene will add the mappings if Ensembl do for LRG RefSeqGenes at least. So, all our mappings come from source at RefSeq. Find this more trust worthy than the data from the LRG project.

Great, I'll get this done ASAP. I have to do other stuff like teaching as well at the moment so time is limited, but I will get on with this and https://github.com/openvar/variantValidator/issues/419 ASAP and ask you to test it once done.

I also need to get you on the the new system. We are applying a log-in to the API because it will have commercial endpoints embedded now. LOVD qualifies for free access for multiple reasons. Will catch you offline to set up a meeting and discuss and hook you up on the new server

Peter-J-Freeman commented 1 year ago

OK, option 1 as described above is complete

Now the RefSeqGene only gives 1 transcript

>>> import json
>>> import VariantValidator
>>> vval = VariantValidator.Validator()
>>> variant = 'NG_011986.1:g.6001del' # variant 1
>>> genome_build = 'GRCh38'
>>> select_transcripts = 'refseqgene'. # Flag set here
>>> transcript_set = 'refseq'
>>> validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
>>> validation = validate.format_as_dict(with_meta=True)
>>> print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "NM_002225.3:c.153+515del": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "15",
            "db_xref": {
                "CCDS": "CCDS10057.2",
                "ensemblgene": null,
                "hgnc": "HGNC:6186",
                "ncbigene": "3712",
                "select": false
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "15q15.1",
            "note": "isovaleryl-CoA dehydrogenase",
            "refseq_select": false,
            "variant": "1"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS10057",
                "CCDS53930"
            ],
            "ensembl_gene_id": "ENSG00000128928",
            "entrez_gene_id": "3712",
            "hgnc_id": "HGNC:6186",
            "omim_id": [
                "607036"
            ],
            "ucsc_id": "uc001zls.4"
        },
        "gene_symbol": "IVD",
        "genome_context_intronic_sequence": "NC_000015.10(NM_002225.3):c.153+515del",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_002216.2:p.?",
            "tlr": "NP_002216.2:p.?"
        },
        "hgvs_refseqgene_variant": "NG_011986.1:g.6002del",
        "hgvs_transcript_variant": "NM_002225.3:c.153+515del",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000015.9:g.40698687del",
                "vcf": {
                    "alt": "G",
                    "chr": "15",
                    "pos": "40698685",
                    "ref": "GA"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000015.10:g.40406486del",
                "vcf": {
                    "alt": "G",
                    "chr": "15",
                    "pos": "40406484",
                    "ref": "GA"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000015.9:g.40698687del",
                "vcf": {
                    "alt": "G",
                    "chr": "chr15",
                    "pos": "40698685",
                    "ref": "GA"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000015.10:g.40406486del",
                "vcf": {
                    "alt": "G",
                    "chr": "chr15",
                    "pos": "40406484",
                    "ref": "GA"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_002216.2",
            "refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_011986.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_002225.3"
        },
        "refseqgene_context_intronic_sequence": "NG_011986.1(NM_002225.3):c.153+515del",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NG_011986.1:g.6001del",
        "transcript_description": "Homo sapiens isovaleryl-CoA dehydrogenase (IVD), transcript variant 1, mRNA",
        "validation_warnings": [
            "A more recent version of the selected reference sequence NM_002225.3 is available (NM_002225.5): NM_002225.5:c.153+515del MUST be fully validated prior to use in reports: select_variants=NM_002225.5:c.153+515del"
        ],
        "variant_exonic_positions": {
            "NC_000015.10": {
                "end_exon": "1i",
                "start_exon": "1i"
            },
            "NC_000015.9": {
                "end_exon": "1i",
                "start_exon": "1i"
            },
            "NG_011986.1": {
                "end_exon": "1i",
                "start_exon": "1i"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.1.dev3+g6ecbf8e.d20220915",
        "variantvalidator_version": "1.0.5.dev295+g795e091.d20221025",
        "vvdb_version": "vvdb_2022_04",
        "vvseqrepo_db": "VV_SR_2022_02/master",
        "vvta_version": "vvta_2022_02"
    }
}
ifokkema commented 1 year ago

Excellent, thank you! Let me know when I can test it!

Does this solve also the last point, that I can't validate NG_011986.1:g.1001del, because it's considered intergenic?

Peter-J-Freeman commented 1 year ago

Yeah, I have done some debugging. Do you want a new release on the API now? I have time.

ifokkema commented 1 year ago

No rush at all, I've got plenty of other stuff to do. I just must keep it on my list until I have validated the fix, since then I need to revert the changes I made to the API, marking "NG" variants as "not supported by VariantValidator". So I'll just keep this open until the API has been updated.

Peter-J-Freeman commented 1 year ago

I'm updating now. Running tests. Will do both APIs old and new. The VVWeb will not be updated just yet though

Peter-J-Freeman commented 1 year ago

Should be running now

ifokkema commented 1 year ago

Hi Pete, it doesn't seem that https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NG_011986.1%3Ag.6001del/refseqgene?content-type=application%2Fjson selects the refseqgene transcripts only. In fact, it doesn't matter whether I send "all" or "refseqgene", it returns the same output.

Peter-J-Freeman commented 1 year ago

Possible/probable the server is not updated yet. I'm concentrating on getting the new build created, then will update all sservers. Bug me again soon so I remember to get you to test it.

Peter-J-Freeman commented 1 year ago

I thought I had updated though. Possible I have not installed the correct VV build yet.

Peter-J-Freeman commented 1 year ago

I will let you know when all servers are updated