Closed ifokkema closed 1 year ago
For every transcript mapping provided, VV mentions dozens of warnings about other reference sequences that have a new version. E.g., for NM_001159508.1:c.153+515del, I get (shortened JSON):
Yeah. That's on my radar. Thanks for opening this
OK, I can deal with VV being focused on transcripts
Not really. It is genome/transcript based. RefSeqGenes are kind of a side-show because they are poorly maintained by RefSeq. To be honest, RefSeq don't really appear to believe in them, especialy since LRG died off! If they don't create good data, we are limited, so the following workflow I believe was chose (although I would have to check)
RefSeqGene to a transcript, then to the genome build, then to all overlapping transcripts.
This was requested by clinicians who were fed up with not being able to map RSG variants to the transcripts they actually wanted and (like LRG) were fed up that barely any transcripts are annotated in RefSeq Gene records.
I really don't want to spend too much time on this, (OPTION 1) but I could perhaps add code that takes the argument select_transcripts=refseqgene which would only return the variant in the context of transcripts that actually are annotated to the specified RSG derived description. The output would be like this
https://rest.variantvalidator.org/VariantValidator/variantvalidator/GRCh38/NG_011986.1%3Ag.6001del/NM_002225.3?content-type=application%2Fjson. The advantage here would be that you would get a faster response, but you would not get mappings to any transcripts not annotated in the variant description.
Alternatively, (OPTION 2) I could filter out all results where hgvs_refseqgenevariant is not "", or pre-filter for mappings from / to a NG which would achieve the same but be loads quicker. It would look like this for this particular variant https://rest.variantvalidator.org/VariantValidator/variantvalidator/GRCh38/NG_011986.1%3Ag.6001del/NM_002225.3%7CNM_002225.5?content-type=application%2Fjson
What do you fancy????
VV doesn't really support NG variants, and if it sort of works
Er, we are thinking of dropping support for RefSeqGenes and LRGs, but will keep it since we have a good use case!-
Either restrict the output to transcripts mapping on the NG, allowing me to find my input easily. Or, allowing NG variants through the VF and LOVD endpoints, where I can turn off mapping and just validate the variant.
Technically, this is the case, you should just have to loop through for hgvs_refseqgene_variant is not "", but as I said above, this is not ideal. I have provided a couple of better options.
Can think about adding this to LOVD. Shouldn't take much additional coding once VV has been updated with one of the options.
Er, we are thinking of dropping support for RefSeqGenes and LRGs, but will keep it since we have a good use case!-
Hehe, well, there's a reason that I haven't run into this yet :wink: If LRGs and NGs were to be deprecated, then we can stop having to worry about them. But until that point, I think we're sort of forced to accept them :unamused:
What do you fancy???? (...) Can think about adding this to LOVD. Shouldn't take much additional coding once VV has been updated with one of the options.
If this would be an option, that would have my preference! Since the LOVD endpoint can turn off pretty much everything, I can then just have the variant validated and forget about the rest. If I would ever really need mapping to the genome or so, I would be fine using the transcript from the NG for that with a second request.
Otherwise, if the extension of VF/the LOVD endpoint isn't feasible, then I believe the select_transcripts=refseqgene
option would be best as it'll result in the fastest execution of VV, focusing only on the transcript(s) in the NG file.
I see, by the way, that the same applies to LRGs.
If this would be an option, that would have my preference! Since the LOVD endpoint can turn off pretty much everything, I can then just have the variant validated and forget about the rest. If I would ever really need mapping to the genome or so, I would be fine using the transcript from the NG for that with a second request.
Sorry, I was not clear. You need to pick a VV update option, then once I have fixed VV, I can integrate it into VF/LOVD. I think select_transcripts=refseqgene
is the way to go, but do we focus on the single RefSeqGene submitted (OPTION1), or do you want liftover onto updated RefSeqGenes etc??? Or onto overlapping RefSeqGenes (OPTION 2)? I don't mind which option. We could perhaps do option 2 for VV and in VF add the additional parameter of liftover=true/false just as we do with genome buids???
I see, by the way, that the same applies to LRGs.
That's true, because we do not hold any LRG records, all LRGs are in facr RefSeqGenes but with 3rd party annotation ;P
Sorry, I was not clear. You need to pick a VV update option, then once I have fixed VV, I can integrate it into VF/LOVD. I think
select_transcripts=refseqgene
is the way to go (...)
Yes, I agree!
but do we focus on the single RefSeqGene submitted (OPTION1), or do you want liftover onto updated RefSeqGenes etc???
No, I really just want to validate the variant. All the mapping and liftover and other fancy things; we use NCs and NMs for. I just need the most rudimentary support for NGs for the HGVS validation interface - is the variant description correct, or not?
We could perhaps do option 2 for VV and in VF add the additional parameter of liftover=true/false just as we do with genome buids???
That'll depend on what other users would need... For me, the fewer transcripts I see, the better :grin:
I see, by the way, that the same applies to LRGs.
That's true, because we do not hold any LRG records, all LRGs are in facr RefSeqGenes but with 3rd party annotation ;P
That explains it :laughing:
That'll depend on what other users would need... For me, the fewer transcripts I see, the better 😁
Yeah, but its my project and I want to keep it simple so OPTION 1 it is ;P
That explains it 😆
What happened to working on LOVD for 20 years. You didn't know this???? Basically, LRG is, RefSeq make a gene reference sequence, Ensembl fabricate some less that reliable annotation and call it an LRG. I have some alarming examples of this. We all thought MANE would be a terrible thing, but in fact it's been a complete boon. It means we can use RefSeq annotation from source and believe what we see and not have to double check what ensembl currated datasets tell us
Happy with option 1 then mate?
That'll depend on what other users would need... For me, the fewer transcripts I see, the better grin
Yeah, but its my project and I want to keep it simple so OPTION 1 it is ;P
Sounds like a win/win!
That explains it laughing
What happened to working on LOVD for 20 years. You didn't know this????
Haha I didn't know that you don't store any LRGs. Yes, I know there first had to be an NG before the LRG was created, but since they have actual records (XML - yuck, but better than plain-text NGs) they can be processed independently. I didn't know you didn't, but just took the NG directly. I'm kinda wondering how you do the transcript mapping then, since, I believe, the transcripts mappings are in the updatable part of the LRG. Anyway, not really relevant but well, you were making fun of me :stuck_out_tongue_closed_eyes:
Happy with option 1 then mate?
Absolutely!
I'm kinda wondering how you do the transcript mapping then, since, I believe, the transcripts mappings are in the updatable part of the LRG
They are, but it's rare and RefSeqGene will add the mappings if Ensembl do for LRG RefSeqGenes at least. So, all our mappings come from source at RefSeq. Find this more trust worthy than the data from the LRG project.
Great, I'll get this done ASAP. I have to do other stuff like teaching as well at the moment so time is limited, but I will get on with this and https://github.com/openvar/variantValidator/issues/419 ASAP and ask you to test it once done.
I also need to get you on the the new system. We are applying a log-in to the API because it will have commercial endpoints embedded now. LOVD qualifies for free access for multiple reasons. Will catch you offline to set up a meeting and discuss and hook you up on the new server
OK, option 1 as described above is complete
Now the RefSeqGene only gives 1 transcript
>>> import json
>>> import VariantValidator
>>> vval = VariantValidator.Validator()
>>> variant = 'NG_011986.1:g.6001del' # variant 1
>>> genome_build = 'GRCh38'
>>> select_transcripts = 'refseqgene'. # Flag set here
>>> transcript_set = 'refseq'
>>> validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
>>> validation = validate.format_as_dict(with_meta=True)
>>> print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
"NM_002225.3:c.153+515del": {
"alt_genomic_loci": [],
"annotations": {
"chromosome": "15",
"db_xref": {
"CCDS": "CCDS10057.2",
"ensemblgene": null,
"hgnc": "HGNC:6186",
"ncbigene": "3712",
"select": false
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": false,
"map": "15q15.1",
"note": "isovaleryl-CoA dehydrogenase",
"refseq_select": false,
"variant": "1"
},
"gene_ids": {
"ccds_ids": [
"CCDS10057",
"CCDS53930"
],
"ensembl_gene_id": "ENSG00000128928",
"entrez_gene_id": "3712",
"hgnc_id": "HGNC:6186",
"omim_id": [
"607036"
],
"ucsc_id": "uc001zls.4"
},
"gene_symbol": "IVD",
"genome_context_intronic_sequence": "NC_000015.10(NM_002225.3):c.153+515del",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "NP_002216.2:p.?",
"tlr": "NP_002216.2:p.?"
},
"hgvs_refseqgene_variant": "NG_011986.1:g.6002del",
"hgvs_transcript_variant": "NM_002225.3:c.153+515del",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000015.9:g.40698687del",
"vcf": {
"alt": "G",
"chr": "15",
"pos": "40698685",
"ref": "GA"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000015.10:g.40406486del",
"vcf": {
"alt": "G",
"chr": "15",
"pos": "40406484",
"ref": "GA"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000015.9:g.40698687del",
"vcf": {
"alt": "G",
"chr": "chr15",
"pos": "40698685",
"ref": "GA"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000015.10:g.40406486del",
"vcf": {
"alt": "G",
"chr": "chr15",
"pos": "40406484",
"ref": "GA"
}
}
},
"reference_sequence_records": {
"protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_002216.2",
"refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_011986.1",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_002225.3"
},
"refseqgene_context_intronic_sequence": "NG_011986.1(NM_002225.3):c.153+515del",
"rna_variant_descriptions": null,
"selected_assembly": "GRCh38",
"submitted_variant": "NG_011986.1:g.6001del",
"transcript_description": "Homo sapiens isovaleryl-CoA dehydrogenase (IVD), transcript variant 1, mRNA",
"validation_warnings": [
"A more recent version of the selected reference sequence NM_002225.3 is available (NM_002225.5): NM_002225.5:c.153+515del MUST be fully validated prior to use in reports: select_variants=NM_002225.5:c.153+515del"
],
"variant_exonic_positions": {
"NC_000015.10": {
"end_exon": "1i",
"start_exon": "1i"
},
"NC_000015.9": {
"end_exon": "1i",
"start_exon": "1i"
},
"NG_011986.1": {
"end_exon": "1i",
"start_exon": "1i"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.0.1.dev3+g6ecbf8e.d20220915",
"variantvalidator_version": "1.0.5.dev295+g795e091.d20221025",
"vvdb_version": "vvdb_2022_04",
"vvseqrepo_db": "VV_SR_2022_02/master",
"vvta_version": "vvta_2022_02"
}
}
Excellent, thank you! Let me know when I can test it!
Does this solve also the last point, that I can't validate NG_011986.1:g.1001del
, because it's considered intergenic?
Yeah, I have done some debugging. Do you want a new release on the API now? I have time.
No rush at all, I've got plenty of other stuff to do. I just must keep it on my list until I have validated the fix, since then I need to revert the changes I made to the API, marking "NG" variants as "not supported by VariantValidator". So I'll just keep this open until the API has been updated.
I'm updating now. Running tests. Will do both APIs old and new. The VVWeb will not be updated just yet though
Should be running now
Hi Pete, it doesn't seem that https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NG_011986.1%3Ag.6001del/refseqgene?content-type=application%2Fjson selects the refseqgene transcripts only. In fact, it doesn't matter whether I send "all" or "refseqgene", it returns the same output.
Possible/probable the server is not updated yet. I'm concentrating on getting the new build created, then will update all sservers. Bug me again soon so I remember to get you to test it.
I thought I had updated though. Possible I have not installed the correct VV build yet.
I will let you know when all servers are updated
Sorry for causing more headaches...!
Describe the bug VV doesn't really support NG variants, and if it sort of works, it shows some redundant validation warnings when processing NG variants. I actually found out my VV library doesn't process NG variants at all as I don't use them in LOVD and apparently never implemented it. But for the new HGVS validation interface, I need to add support for it, and I ran into some issues.
To Reproduce My use case is checking
NG_011986.1:g.6001del
. It should be corrected toNG_011986.1:g.6002del
. Calling VV, picking a random build and selecting all transcripts: https://rest.variantvalidator.org/VariantValidator/variantvalidator/hg19/NG_011986.1%3Ag.6001del/all?content-type=application%2FjsonNM_001159508.1:c.153+515del
, has absolutely no mention of any NG-based variant. The second mapping,NM_001159508.2:c.144+515del
, also has an emptyhgvs_refseqgene_variant
field. More than a dozen transcript mappings later, I finally find my corrected input. Transcript mappingNM_002225.3:c.153+515del
mentions"hgvs_refseqgene_variant": "NG_011986.1:g.6002del"
. Interestingly enough, transcript mappingNM_002225.5:c.144+515del
mentionsNG_011986.2:g.6002del
, which is a different NG version! None of the other 21 transcript mappings report my corrected NG variant. I assume this is because these transcripts are not in the NG. In that case, I don't want to see them, either. I can't just randomly select a transcript since I just have the NG input. I could have used "select" here as a filter (but not "mane_select"), although I'm afraid that won't always help me. Also, dropping the 22 transcripts that I can't use would speed up the VV process a lot.Removing redundant reference bases from variant description
. But, I didn't provide any redundant reference bases. I believe this is an artifact of an internal VV mapping to the NC. I think dropping the mapping to the NC to fix the first issue above here will likely resolve this issue as well.NM_001159508.1:c.153+515del
, I get (shortened JSON):NG_011986.1:g.1001del
is a correct variant description is not possible because no transcripts map to that position. But, I don't need mapping, just validation.Expected behavior Either restrict the output to transcripts mapping on the NG, allowing me to find my input easily. Or, allowing NG variants through the VF and LOVD endpoints, where I can turn off mapping and just validate the variant.