Open sbenny1230 opened 2 years ago
Irrespective of the the transcript set that you are selecting, 11-5248232-T-A is intergenic in the context of GRCh38 but lies within the HBB gene in GRCh37. Variant 11-5248232-T-A in GRCh38 is equivalent to 11-5269462-T-A in GRCh37. That much is certain.
I notice that you are trying to validate ENST00000330597.3:c.*127A>T which is presumably valid in the context of reference sequence ENST00000330597.3 and which lies in the 3´ UTR of the transcript. It's worth noting that ENST00000330597.3 is NOT the MANE Select transcript for HBB.
What happens when you validate ENST00000330597.5:c.*127A>T instead?
I got confused by my gene symbols. ENST00000330597.3 and ENST00000330597.are transcripts for HBG1, not HBB.
The reason you are getting gene information for the Ensembl method and not the RefSeq is that you hit a transcript in the ensembl but do not in the refseq. That is certainly as expected.
RefSeq has a transcript for HBG1, https://www.ncbi.nlm.nih.gov/nuccore/NM_000559.3
What is odd is that the GRCh37 Variant NC_000011.9:g.5269462T>A does indeed hit a UTR that will only be found in the Ensembl transcript
The Ensembl transcript was shortened from v.3 to v.5 to become MANE Select
So, what is actually weird is that transcript ENST00000330597.3 is being picked up when you provide a GRCh38 coordinate.
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
Perhaps this one needs to be discussed. We need to determine where the transcript is being identified and why. My guess is that ENST00000330597.3 was aligned to GRCh38 pre MANE project but Ensembl's lack of identifiable revision history makes this difficult to pin down. Certainly it is in VVTA and it would certainly be correctly picked up by variant = '11-5248232-T-A' GRCh38. @John-F-Wagstaff , is there a quick way to see if VVTA contains mappings for ENST00000330597.3 with GRCh38. Having looked at Ensembl GRCh37, I suspect it should not have been, but who knows???
@sbenny1230 , do you want me to pull this and look where the transcripts are being picked up???
@sbenny1230 Are you using
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
and
variant = '11-5248232-T-A'
genome_build = 'GRCh37'
If so, this is not correct. As @leicray states correctly 11-5248232-T-A in GRCh38 is equivalent to 11-5269462-T-A in GRCh37.
I would expect
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
To miss all transcripts
But
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5269462-T-A' # Note the changed coordinate because the position is different in GRCh37
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
To hit ENST00000330597.3
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5248232-T-A'
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
May well hit HBB because it is a different variant to 11-5248232-T-A GRCh38
Look here
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000011.9:g.5269462T>A", # position
"vcf": {
"alt": "A",
"chr": "11",
"pos": "5269462",
"ref": "T"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000011.10:g.5248232T>A", # position
"vcf": {
"alt": "A",
"chr": "11",
"pos": "5248232",
"ref": "T"
}
},
Here is the evidence,
variant = '11-5269462-T-A' # Note the changed coordinate because the position is different in GRCh37 genome_build = 'GRCh37'
maps to HBB not HBG1
@Peter-J-Freeman If this is needed for bug-checking later then the quickest way to know which accessions a transcript is mapped to in the raw database, if you have a local copy, is to login via PostgreSQL, set the search path, and do one of two queries using the transcript accession. The first is to search the current valid mapped transcript spans view, to see whether the transcript exists and is visible to a by position search (i.e has not been masked as "removed") the second is to search the underlying mapped_exon_set table to see weather the mapping exists in the raw data e.g.
select * from current_valid_mapped_transcript_spans_mv where tx_ac = 'ENST00000330597.3';
and/or
select * from mapped_exon_set where tx_ac = 'ENST00000330597.3';
For the latter you can then use the mapped exon set IDs in the return to run second query if you need to check the mapping locations e.g.
select * from mapped_exon where exon_set_id = 400504
This does not test everything, and is not quite as convenient as going from the genome id, but it should provide a quick confirmation, and the transcript archive database has no concept of genome anyway.
So validating ENST00000330597.5:c.*127A>T
gives me the following output:
{
"flag": "warning",
"metadata": {
"variantvalidator_hgvs_version": "2.0.2.dev1+g6ecbf8e",
"variantvalidator_version": "1.0.5.dev324+g714b8d1.d20220718",
"vvdb_version": "vvdb_2022_04",
"vvseqrepo_db": "VV_SR_2022_02/master",
"vvta_version": "vvta_2022_02"
},
"validation_warning_1": {
"alt_genomic_loci": [],
"annotations": {},
"gene_ids": {},
"gene_symbol": "",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "",
"tlr": ""
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "",
"primary_assembly_loci": {},
"reference_sequence_records": "",
"refseqgene_context_intronic_sequence": "",
"selected_assembly": "GRCh38",
"submitted_variant": "ENST00000330597.5:c.*127A>T",
"transcript_description": "",
"validation_warnings": [
"Using a transcript reference sequence to specify a variant position that lies outside of the reference sequence is not HGVS-compliant: Instead re-submit NC_000011.10:g.5248232T>A"
],
"variant_exonic_positions": null
}
}
And when using the suggested variant description NC_000011.10:g.5248232T>A
I get:
{
"ENST00000330597.3:c.*127A>T": {
"alt_genomic_loci": [],
"annotations": {
"chromosome": "11",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000213934",
"hgnc": "4831",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "11p15.4",
"note": "hemoglobin subunit gamma 1",
"refseq_select": false,
"variant": "001"
},
"gene_ids": {
"ccds_ids": [
"CCDS7754"
],
"ensembl_gene_id": "ENSG00000213934",
"entrez_gene_id": "3047",
"hgnc_id": "HGNC:4831",
"omim_id": [
"142200"
],
"ucsc_id": "uc001mah.2"
},
"gene_symbol": "HBG1",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "ENSP00000327431.3:p.?",
"tlr": "ENSP00000327431.3:p.?"
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "ENST00000330597.3:c.*127A>T",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000011.9:g.5269462T>A",
"vcf": {
"alt": "A",
"chr": "11",
"pos": "5269462",
"ref": "T"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000011.10:g.5248232T>A",
"vcf": {
"alt": "A",
"chr": "11",
"pos": "5248232",
"ref": "T"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000011.9:g.5269462T>A",
"vcf": {
"alt": "A",
"chr": "chr11",
"pos": "5269462",
"ref": "T"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000011.10:g.5248232T>A",
"vcf": {
"alt": "A",
"chr": "chr11",
"pos": "5248232",
"ref": "T"
}
}
},
"reference_sequence_records": {
"protein": "https://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;p=ENSP00000327431.3",
"transcript": "https://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000330597.3"
},
"refseqgene_context_intronic_sequence": "",
"selected_assembly": "GRCh38",
"submitted_variant": "NC_000011.10:g.5248232T>A",
"transcript_description": "HBG1-001",
"validation_warnings": [
"A more recent version of the selected reference sequence ENST00000330597.3 is available (ENST00000330597.5): ENST00000330597.5:c.*127A>T MUST be fully validated prior to use in reports: select_variants=ENST00000330597.5:c.*127A>T, genome_build=GRCh38"
],
"variant_exonic_positions": null
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.0.2.dev1+g6ecbf8e",
"variantvalidator_version": "1.0.5.dev324+g714b8d1.d20220718",
"vvdb_version": "vvdb_2022_04",
"vvseqrepo_db": "VV_SR_2022_02/master",
"vvta_version": "vvta_2022_02"
}
}
This is with the genome build as GRCh38 and the ensembl method.
@sbenny1230 Are you using
variant = '11-5248232-T-A' genome_build = 'GRCh38'
and
variant = '11-5248232-T-A' genome_build = 'GRCh37'
If so, this is not correct. As @leicray states correctly 11-5248232-T-A in GRCh38 is equivalent to 11-5269462-T-A in GRCh37.
I would expect
import json import VariantValidator vval = VariantValidator.Validator() variant = '11-5248232-T-A' genome_build = 'GRCh38' select_transcripts = 'all' transcript_set = 'ensembl' validate = vval.validate(variant, genome_build, select_transcripts, transcript_set) validation = validate.format_as_dict(with_meta=True) print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
To miss all transcripts
But
import json import VariantValidator vval = VariantValidator.Validator() variant = '11-5269462-T-A' # Note the changed coordinate because the position is different in GRCh37 genome_build = 'GRCh37' select_transcripts = 'all' transcript_set = 'ensembl' validate = vval.validate(variant, genome_build, select_transcripts, transcript_set) validation = validate.format_as_dict(with_meta=True) print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
To hit ENST00000330597.3
import json import VariantValidator vval = VariantValidator.Validator() variant = '11-5248232-T-A' genome_build = 'GRCh37' select_transcripts = 'all' transcript_set = 'ensembl' validate = vval.validate(variant, genome_build, select_transcripts, transcript_set) validation = validate.format_as_dict(with_meta=True) print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
May well hit HBB because it is a different variant to 11-5248232-T-A GRCh38
Look here
"primary_assembly_loci": { "grch37": { "hgvs_genomic_description": "NC_000011.9:g.5269462T>A", # position "vcf": { "alt": "A", "chr": "11", "pos": "5269462", "ref": "T" } }, "grch38": { "hgvs_genomic_description": "NC_000011.10:g.5248232T>A", # position "vcf": { "alt": "A", "chr": "11", "pos": "5248232", "ref": "T" } },
@Peter-J-Freeman
So when using
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
I get ENST00000330597.3 and when it's GRCh37, I get two different transcripts ENST00000485743.1 and ENST00000335295.4 which from what you've said is not the expected output for both cases.
Yeah, so that's a bit weird. I will take a look at where this is being picked up then. Which branch please @sbenny1230
With a query of variant = '11-5248232-T-A' and genome_build = 'GRCh38' you would expect to see ENST00000330597.3 as in NC_000011.10 ENST00000330597.3 has a span of 5248082 to 5249892
With a query of variant = '11-5248232-T-A' and genome_build = 'GRCh37' you would expect to see ENST00000335295.4 as in NC_000011.9 ENST00000335295.4 has a span of 5246693 to 5248301 and ENST00000485743.1 as in NC_000011.9 ENST00000485743.1 has a span of 5247492 to 5248302.
@sbenny1230 Are you sure you are switching the variants location to match the genome as well as the genome? it should be either
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
or
variant = '11-5269462-T-A'
genome_build = 'GRCh37'
not any other combination. As @leicray said earlier the sequence of the reference has been updated between versions, and this moves the coordinates of nearly everything.
@Peter-J-Freeman if we can confirm that this is happening with the correct locations to match the genome build then it looks like something is causing GRCh37 coordinates to be lifted over to GRCh38, but then passed down to the lower levels of the code still tagged as GRCh37 locations? or something like that
@John-F-Wagstaff Hi, I got both of the expected outputs you mentioned. I hadn't changed the variant input to take into account the change in the location for GRCh37.
@Peter-J-Freeman Latest changes are on the branch vv_ensembl_dev_susmi
@John-F-Wagstaff and @sbenny1230
So, I'm looking at the code and I can confirm I am running
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
The genomic variant is converted to NC_000011.10:g.5248232T>A which looks correct
When we search VVTA using
hdp.get_tx_for_region(hgvs_genomic.ac, alt_aln_method,
hgvs_genomic.posedit.pos.start.base-1,
hgvs_genomic.posedit.pos.end.base-1)
which is here
the response is
[['ENST00000330597.3', 'NC_000011.10', -1, 'genebuild', 5248082, 5249892]]
This looks like VVTA has mapping data for 'ENST00000330597.3', 'NC_000011.10'
If I run
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5269462-T-A' # Note the changed coordinate because the position is different in GRCh37
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
I get
NC_000011.9:g.5269462T>A
[['ENST00000330597.3', 'NC_000011.9', -1, 'genebuild', 5269312, 5271122]]
Which shows GRCh37 mapping data for ENST00000330597.3
https://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000213934;r=11:5248269-5249857;t=ENST00000330597 shows that the GRCh38 version should most likely be ENST00000330597.5
https://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000213934;r=11:5269313-5271122;t=ENST00000330597 shows the .3 version.
So my guess is that since we take VVTA data straight from Ensembl files, this is an instance where a transcript has been aligned to both genome builds without a version history revision???????
For further evidence I did
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000011.10:g.5249804T>A' # variant 1
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
and got 3 transcripts for GRCh38, .3 and .5 and ENST00000632727.1
NC_000011.10:g.5249804T>A
[['ENST00000330597.3', 'NC_000011.10', -1, 'genebuild', 5248082, 5249892],
['ENST00000330597.5', 'NC_000011.10', -1, 'genebuild', 5248268, 5249857],
['ENST00000632727.1', 'NC_000011.10', -1, 'genebuild', 5248427, 5249836]]
But only 2 should be expected, .5 and the ENST00000632727.1 transcript
So as discussed @sbenny1230 , .5 is in the database and is aligned to GRCh38, its just that NC_000011.10:g.5248232T>A does not overlap with the transcript, but NC_000011.10:g.5249804T>A does
If I try
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000011.9:g.5271034T>A' # variant 1
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
I get
NC_000011.9:g.5271034T>A
[['ENST00000330597.3', 'NC_000011.9', -1, 'genebuild', 5269312, 5271122]]
Which is expected.
So, my overall conclusion is that we must have found alignment data for 'ENST00000330597.3', 'NC_000011.10' from somewhere, likely in an ensembl alignment file. The questions are, why has this happened because I thought Ensembl enforced a version number increase when a new genome build was released because the alignment had to change based on new genomic coordinates. Also, is the sequence of ENST00000330597.3 GRCh38 == ENST00000330597.3 GRCh37 (John, is this one of those badly handled transcripts you found previously?). Finally, where did this 'ENST00000330597.3', 'NC_000011.10' alignment come from????? We could do to track down the file. I will start digging this afternoon.
Here is the HBG1 gene data from Ensembl build 94 GRCh38
http://ftp.ensembl.org/pub/release-94/gff3/homo_sapiens/
11 ensembl_havana gene 5248079 5249859 . - . ID=gene:ENSG00000213934;Name=HBG1;biotype=protein_coding;description=hemoglobin subunit gamma 1 [Source:HGNC Symbol%3BAcc:HGNC:4831];gene_id=ENSG00000213934;logic_name=ensembl_havana_gene;version=8
11 ensembl_havana mRNA 5248079 5249859 . - . ID=transcript:ENST00000330597;Parent=gene:ENSG00000213934;Name=HBG1-201;biotype=protein_coding;ccdsid=CCDS7754.1;tag=basic;transcript_id=ENST00000330597;transcript_support_level=1 (assigned to previous version 3);version=4
11 ensembl_havana three_prime_UTR 5248079 5248358 . - . Parent=transcript:ENST00000330597
11 ensembl_havana exon 5248079 5248487 . - . Parent=transcript:ENST00000330597;Name=ENSE00001873394;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSE00001873394;rank=3;version=2
11 ensembl_havana CDS 5248359 5248487 . - 0 ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11 ensembl_havana exon 5249368 5249590 . - . Parent=transcript:ENST00000330597;Name=ENSE00001598355;constitutive=0;ensembl_end_phase=0;ensembl_phase=2;exon_id=ENSE00001598355;rank=2;version=1
11 ensembl_havana CDS 5249368 5249590 . - 1 ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11 ensembl_havana CDS 5249713 5249804 . - 0 ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11 ensembl_havana exon 5249713 5249859 . - . Parent=transcript:ENST00000330597;Name=ENSE00001956149;constitutive=0;ensembl_end_phase=2;ensembl_phase=-1;exon_id=ENSE00001956149;rank=1;version=2
11 ensembl_havana five_prime_UTR 5249805 5249859 . - . Parent=transcript:ENST00000330597
11 havana lnc_RNA 5248270 5249855 . - . ID=transcript:ENST00000648735;Parent=gene:ENSG00000213934;Name=HBG1-203;biotype=retained_intron;transcript_id=ENST00000648735;version=1
11 havana exon 5248270 5249590 . - . Parent=transcript:ENST00000648735;Name=ENSE00003835834;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003835834;rank=2;version=1
11 havana exon 5249713 5249855 . - . Parent=transcript:ENST00000648735;Name=ENSE00003840601;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003840601;rank=1;version=1
11 havana mRNA 5248428 5249836 . - . ID=transcript:ENST00000632727;Parent=gene:ENSG00000213934;Name=HBG1-202;biotype=protein_coding;tag=basic;transcript_id=ENST00000632727;transcript_support_level=3;version=1
11 havana exon 5248428 5248487 . - . Parent=transcript:ENST00000632727;Name=ENSE00003780536;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003780536;rank=3;version=1
11 havana three_prime_UTR 5248428 5248487 . - . Parent=transcript:ENST00000632727
11 havana three_prime_UTR 5249368 5249551 . - . Parent=transcript:ENST00000632727
11 havana exon 5249368 5249590 . - . Parent=transcript:ENST00000632727;Name=ENSE00003783162;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSE00003783162;rank=2;version=1
11 havana CDS 5249552 5249590 . - 0 ID=CDS:ENSP00000488759;Parent=transcript:ENST00000632727;protein_id=ENSP00000488759
11 havana CDS 5249751 5249804 . - 0 ID=CDS:ENSP00000488759;Parent=transcript:ENST00000632727;protein_id=ENSP00000488759
11 havana exon 5249751 5249836 . - . Parent=transcript:ENST00000632727;Name=ENSE00003775891;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=ENSE00003775891;rank=1;version=1
11 havana five_prime_UTR 5249805 5249836 . - . Parent=transcript:ENST00000632727
And from build 90
http://ftp.ensembl.org/pub/release-90/gff3/homo_sapiens/
Build 90 IS used in VVTA https://github.com/openvar/VVTA/tree/master/EnsemblUTA/ensembl_90
11 ensembl_havana gene 5248083 5249892 . - . ID=gene:ENSG00000213934;Name=HBG1;biotype=protein_coding;description=hemoglobin subunit gamma 1 [Source:HGNC Symbol%3BAcc:HGNC:4831];gene_id=ENSG00000213934;logic_name=ensembl_havana_gene;version=6
11 ensembl_havana mRNA 5248083 5249892 . - . ID=transcript:ENST00000330597;Parent=gene:ENSG00000213934;Name=HBG1-201;biotype=protein_coding;ccdsid=CCDS7754.1;tag=basic;transcript_id=ENST00000330597;transcript_support_level=1;version=3
11 ensembl_havana three_prime_UTR 5248083 5248358 . - . Parent=transcript:ENST00000330597
11 ensembl_havana exon 5248083 5248487 . - . Parent=transcript:ENST00000330597;Name=ENSE00001873394;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSE00001873394;rank=3;version=1
11 ensembl_havana CDS 5248359 5248487 . - 0 ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11 ensembl_havana exon 5249368 5249590 . - . Parent=transcript:ENST00000330597;Name=ENSE00001598355;constitutive=0;ensembl_end_phase=0;ensembl_phase=2;exon_id=ENSE00001598355;rank=2;version=1
11 ensembl_havana CDS 5249368 5249590 . - 1 ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11 ensembl_havana CDS 5249713 5249804 . - 0 ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11 ensembl_havana exon 5249713 5249892 . - . Parent=transcript:ENST00000330597;Name=ENSE00001956149;constitutive=0;ensembl_end_phase=2;ensembl_phase=-1;exon_id=ENSE00001956149;rank=1;version=1
11 ensembl_havana five_prime_UTR 5249805 5249892 . - . Parent=transcript:ENST00000330597
11 havana mRNA 5248428 5249836 . - . ID=transcript:ENST00000632727;Parent=gene:ENSG00000213934;Name=HBG1-202;biotype=protein_coding;tag=basic;transcript_id=ENST00000632727;transcript_support_level=3;version=1
11 havana exon 5248428 5248487 . - . Parent=transcript:ENST00000632727;Name=ENSE00003780536;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003780536;rank=3;version=1
11 havana three_prime_UTR 5248428 5248487 . - . Parent=transcript:ENST00000632727
11 havana three_prime_UTR 5249368 5249551 . - . Parent=transcript:ENST00000632727
11 havana exon 5249368 5249590 . - . Parent=transcript:ENST00000632727;Name=ENSE00003783162;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSE00003783162;rank=2;version=1
11 havana CDS 5249552 5249590 . - 0 ID=CDS:ENSP00000488759;Parent=transcript:ENST00000632727;protein_id=ENSP00000488759
11 havana CDS 5249751 5249804 . - 0 ID=CDS:ENSP00000488759;Parent=transcript:ENST00000632727;protein_id=ENSP00000488759
11 havana exon 5249751 5249836 . - . Parent=transcript:ENST00000632727;Name=ENSE00003775891;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=ENSE00003775891;rank=1;version=1
11 havana five_prime_UTR 5249805 5249836 . - . Parent=transcript:ENST00000632727
These lines are the smoking guns
11 ensembl_havana mRNA 5248083 5249892 . - . ID=transcript:ENST00000330597;Parent=gene:ENSG00000213934;Name=HBG1-201;biotype=protein_coding;ccdsid=CCDS7754.1;tag=basic;transcript_id=ENST00000330597;transcript_support_level=1;version=3
and
11 ensembl_havana mRNA 5248079 5249859 . - . ID=transcript:ENST00000330597;Parent=gene:ENSG00000213934;Name=HBG1-201;biotype=protein_coding;ccdsid=CCDS7754.1;tag=basic;transcript_id=ENST00000330597;transcript_support_level=1 (assigned to previous version 3);version=4
Clearly the .3 version "WAS" aligned to GRCh38 in build 90, and there seems to have been a build v .4 by by build 94 (Not in VVTA).
Just gonna check the sequences from GRCh38 and GRCh37, which is now completed and the sequences are identical. So, I am now really stumped as to when Ensembl enforce version updates. I assumed that a genome build would cause an update because genomic coords must change. Any ideas @leicray, @ifokkema @sbenny1230 @John-F-Wagstaff
@Peter-J-Freeman
I thought Ensembl enforced a version number increase when a new genome build was released because the alignment had to change based on new genomic coordinates.
No they do not, at the very least if the exon start stop positions within the transcript are the same, and the sequence is the same, they keep the version number equal or at least have done so as a matter of course. If they align to an alt they have a completely new ID, nowadays, but not for version updates of the same sequence. I understood originally that the version number would go up for any sequence / alignment change wrt the transcript. Unfortunately as we have discovered it is in fact worse than this, probably only the alignment structure, i.e. within transcript exon position or transcript length changes that will cause a version update, not sequence changes.
We know this because one of the things we had to handle with the new VVTA release was transcripts with the same name and version but different sequences for the GRCh37 and GRCh38 builds. This causes problems since, if you are not mad, you would naturally would assume same id same version number means same sequence and everything code wise, whether it was written by us or others, assumes this is true. We suffixed all of these to stop the problem, but this still means that if your variant maps to one of these transcripts you will have to use the refSeq version, as the Ensembl version is not hgvs compliant.
I'm losing the will to live.
Hahaha. Yep, that's an appropriate response.
Anyway, fact is @sbenny1230 I should have said, your code is working perfectly. Carry on :
@John-F-Wagstaff
This is not actually bad news. The liftover tool works much faster if it can find a transcript aligned to both genome builds.
I think what we need to consider is adding all Ensembl builds to VVTA. May be a bit of a task, but it will fill in multiple blanks for ensembl transcripts which have been rapidly updated e.g. ENST00000330597.3 > .4 > .5. Ensembl do not provide a decent revision history , but having this data in VVTA would allow us to provide one.
We have discussed this as one of our long term plans, time/funding permitting, before. This would take some time to work through but is a solid doable feature, the only problem is that as we add more we slow the database down a bit and fatten it up too. As such I also suggested we might want to create a full Ensembl archive, for interested users, instead, separate from the normal version. Doing something similar for RefSeq too, if we can get the archived RefSeq alignments, was also discussed, but given the time implications of either no concrete plans where made, just long term maybes.
There is a difference in the result between refseq and ensembl and I'm not too sure why.
Basically when the genome build is GRCh38 and transcript_set is 'refseq', the variant
11-5248232-T-A
it says there are no fully overlapping transcripts but for ensembl it finds one transcript.I also noticed the gene symbol is HBB when the genome build is GRCh37 and refseq. But when it's ensembl and the genome build is GRCh38 it is HBG1. I'm not too sure why this is either.
Input:
When the transcript_set is 'refseq':