openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
70 stars 21 forks source link

Mismatch in results between ensembl and refseq #405

Open sbenny1230 opened 2 years ago

sbenny1230 commented 2 years ago

There is a difference in the result between refseq and ensembl and I'm not too sure why.

Basically when the genome build is GRCh38 and transcript_set is 'refseq', the variant 11-5248232-T-A it says there are no fully overlapping transcripts but for ensembl it finds one transcript.

I also noticed the gene symbol is HBB when the genome build is GRCh37 and refseq. But when it's ensembl and the genome build is GRCh38 it is HBG1. I'm not too sure why this is either.

Input:

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "ENST00000330597.3:c.*127A>T": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "11",
            "db_xref": {
                "CCDS": null,
                "ensemblgene": "ENSG00000213934",
                "hgnc": "4831",
                "ncbigene": null,
                "select": "Ensembl"
            },
            "ensembl_select": true,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "11p15.4",
            "note": "hemoglobin subunit gamma 1",
            "refseq_select": false,
            "variant": "001"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS7754"
            ],
            "ensembl_gene_id": "ENSG00000213934",
            "entrez_gene_id": "3047",
            "hgnc_id": "HGNC:4831",
            "omim_id": [
                "142200"
            ],
            "ucsc_id": "uc001mah.2"
        },
        "gene_symbol": "HBG1",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "ENSP00000327431.3:p.?",
            "tlr": "ENSP00000327431.3:p.?"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "ENST00000330597.3:c.*127A>T",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000011.9:g.5269462T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5269462",
                    "ref": "T"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000011.10:g.5248232T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5248232",
                    "ref": "T"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000011.9:g.5269462T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "chr11",
                    "pos": "5269462",
                    "ref": "T"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000011.10:g.5248232T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "chr11",
                    "pos": "5248232",
                    "ref": "T"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;p=ENSP00000327431.3",
            "transcript": "https://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000330597.3"
        },
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "GRCh38",
        "submitted_variant": "11-5248232-T-A",
        "transcript_description": "HBG1-001",
        "validation_warnings": [
            "A more recent version of the selected reference sequence ENST00000330597.3 is available (ENST00000330597.5): ENST00000330597.5:c.*127A>T MUST be fully validated prior to use in reports: select_variants=ENST00000330597.5:c.*127A>T, genome_build=GRCh38"
        ],
        "variant_exonic_positions": null
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.2.dev1+g6ecbf8e",
        "variantvalidator_version": "1.0.5.dev324+g714b8d1.d20220718",
        "vvdb_version": "vvdb_2022_04",
        "vvseqrepo_db": "VV_SR_2022_02/master",
        "vvta_version": "vvta_2022_02"
    }
}

When the transcript_set is 'refseq':

{
    "flag": "intergenic",
    "intergenic_variant_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "NG_000007.3:g.49384A>T",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000011.9:g.5269462T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5269462",
                    "ref": "T"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000011.10:g.5248232T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5248232",
                    "ref": "T"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000011.9:g.5269462T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "chr11",
                    "pos": "5269462",
                    "ref": "T"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000011.10:g.5248232T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "chr11",
                    "pos": "5248232",
                    "ref": "T"
                }
            }
        },
        "reference_sequence_records": {
            "refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_000007.3"
        },
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "GRCh38",
        "submitted_variant": "11-5248232-T-A",
        "transcript_description": "",
        "validation_warnings": [
            "No transcripts found that fully overlap the described variation in the genomic sequence"
        ],
        "variant_exonic_positions": null
    },
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.2.dev1+g6ecbf8e",
        "variantvalidator_version": "1.0.5.dev324+g714b8d1.d20220718",
        "vvdb_version": "vvdb_2022_04",
        "vvseqrepo_db": "VV_SR_2022_02/master",
        "vvta_version": "vvta_2022_02"
    }
}
leicray commented 2 years ago

Irrespective of the the transcript set that you are selecting, 11-5248232-T-A is intergenic in the context of GRCh38 but lies within the HBB gene in GRCh37. Variant 11-5248232-T-A in GRCh38 is equivalent to 11-5269462-T-A in GRCh37. That much is certain.

I notice that you are trying to validate ENST00000330597.3:c.*127A>T which is presumably valid in the context of reference sequence ENST00000330597.3 and which lies in the 3´ UTR of the transcript. It's worth noting that ENST00000330597.3 is NOT the MANE Select transcript for HBB.

What happens when you validate ENST00000330597.5:c.*127A>T instead?

leicray commented 2 years ago

I got confused by my gene symbols. ENST00000330597.3 and ENST00000330597.are transcripts for HBG1, not HBB.

Peter-J-Freeman commented 2 years ago

The reason you are getting gene information for the Ensembl method and not the RefSeq is that you hit a transcript in the ensembl but do not in the refseq. That is certainly as expected.

RefSeq has a transcript for HBG1, https://www.ncbi.nlm.nih.gov/nuccore/NM_000559.3

What is odd is that the GRCh37 Variant NC_000011.9:g.5269462T>A does indeed hit a UTR that will only be found in the Ensembl transcript

image

The Ensembl transcript was shortened from v.3 to v.5 to become MANE Select

image

So, what is actually weird is that transcript ENST00000330597.3 is being picked up when you provide a GRCh38 coordinate.

variant = '11-5248232-T-A'
genome_build = 'GRCh38'

Perhaps this one needs to be discussed. We need to determine where the transcript is being identified and why. My guess is that ENST00000330597.3 was aligned to GRCh38 pre MANE project but Ensembl's lack of identifiable revision history makes this difficult to pin down. Certainly it is in VVTA and it would certainly be correctly picked up by variant = '11-5248232-T-A' GRCh38. @John-F-Wagstaff , is there a quick way to see if VVTA contains mappings for ENST00000330597.3 with GRCh38. Having looked at Ensembl GRCh37, I suspect it should not have been, but who knows???

@sbenny1230 , do you want me to pull this and look where the transcripts are being picked up???

Peter-J-Freeman commented 2 years ago

@sbenny1230 Are you using

variant = '11-5248232-T-A'
genome_build = 'GRCh38'

and

variant = '11-5248232-T-A'
genome_build = 'GRCh37'

If so, this is not correct. As @leicray states correctly 11-5248232-T-A in GRCh38 is equivalent to 11-5269462-T-A in GRCh37.

I would expect

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

To miss all transcripts

But

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5269462-T-A' # Note the changed coordinate because the position is different in GRCh37
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

To hit ENST00000330597.3

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5248232-T-A'
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

May well hit HBB because it is a different variant to 11-5248232-T-A GRCh38

Look here

        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000011.9:g.5269462T>A", # position
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5269462",
                    "ref": "T"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000011.10:g.5248232T>A", # position
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5248232",
                    "ref": "T"
                }
            },
Peter-J-Freeman commented 2 years ago

Here is the evidence,

variant = '11-5269462-T-A' # Note the changed coordinate because the position is different in GRCh37 genome_build = 'GRCh37'

maps to HBB not HBG1

image

John-F-Wagstaff commented 2 years ago

@Peter-J-Freeman If this is needed for bug-checking later then the quickest way to know which accessions a transcript is mapped to in the raw database, if you have a local copy, is to login via PostgreSQL, set the search path, and do one of two queries using the transcript accession. The first is to search the current valid mapped transcript spans view, to see whether the transcript exists and is visible to a by position search (i.e has not been masked as "removed") the second is to search the underlying mapped_exon_set table to see weather the mapping exists in the raw data e.g.

select * from current_valid_mapped_transcript_spans_mv where tx_ac = 'ENST00000330597.3'; and/or select * from mapped_exon_set where tx_ac = 'ENST00000330597.3'; For the latter you can then use the mapped exon set IDs in the return to run second query if you need to check the mapping locations e.g. select * from mapped_exon where exon_set_id = 400504

This does not test everything, and is not quite as convenient as going from the genome id, but it should provide a quick confirmation, and the transcript archive database has no concept of genome anyway.

sbenny1230 commented 2 years ago

So validating ENST00000330597.5:c.*127A>T gives me the following output:

{
    "flag": "warning",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.2.dev1+g6ecbf8e",
        "variantvalidator_version": "1.0.5.dev324+g714b8d1.d20220718",
        "vvdb_version": "vvdb_2022_04",
        "vvseqrepo_db": "VV_SR_2022_02/master",
        "vvta_version": "vvta_2022_02"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "GRCh38",
        "submitted_variant": "ENST00000330597.5:c.*127A>T",
        "transcript_description": "",
        "validation_warnings": [
            "Using a transcript reference sequence to specify a variant position that lies outside of the reference sequence is not HGVS-compliant: Instead re-submit NC_000011.10:g.5248232T>A"
        ],
        "variant_exonic_positions": null
    }
}

And when using the suggested variant description NC_000011.10:g.5248232T>A I get:

{
    "ENST00000330597.3:c.*127A>T": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "11",
            "db_xref": {
                "CCDS": null,
                "ensemblgene": "ENSG00000213934",
                "hgnc": "4831",
                "ncbigene": null,
                "select": "Ensembl"
            },
            "ensembl_select": true,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "11p15.4",
            "note": "hemoglobin subunit gamma 1",
            "refseq_select": false,
            "variant": "001"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS7754"
            ],
            "ensembl_gene_id": "ENSG00000213934",
            "entrez_gene_id": "3047",
            "hgnc_id": "HGNC:4831",
            "omim_id": [
                "142200"
            ],
            "ucsc_id": "uc001mah.2"
        },
        "gene_symbol": "HBG1",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "ENSP00000327431.3:p.?",
            "tlr": "ENSP00000327431.3:p.?"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "ENST00000330597.3:c.*127A>T",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000011.9:g.5269462T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5269462",
                    "ref": "T"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000011.10:g.5248232T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5248232",
                    "ref": "T"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000011.9:g.5269462T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "chr11",
                    "pos": "5269462",
                    "ref": "T"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000011.10:g.5248232T>A",
                "vcf": {
                    "alt": "A",
                    "chr": "chr11",
                    "pos": "5248232",
                    "ref": "T"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;p=ENSP00000327431.3",
            "transcript": "https://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;t=ENST00000330597.3"
        },
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "GRCh38",
        "submitted_variant": "NC_000011.10:g.5248232T>A",
        "transcript_description": "HBG1-001",
        "validation_warnings": [
            "A more recent version of the selected reference sequence ENST00000330597.3 is available (ENST00000330597.5): ENST00000330597.5:c.*127A>T MUST be fully validated prior to use in reports: select_variants=ENST00000330597.5:c.*127A>T, genome_build=GRCh38"
        ],
        "variant_exonic_positions": null
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.2.dev1+g6ecbf8e",
        "variantvalidator_version": "1.0.5.dev324+g714b8d1.d20220718",
        "vvdb_version": "vvdb_2022_04",
        "vvseqrepo_db": "VV_SR_2022_02/master",
        "vvta_version": "vvta_2022_02"
    }
}

This is with the genome build as GRCh38 and the ensembl method.

sbenny1230 commented 2 years ago

@sbenny1230 Are you using

variant = '11-5248232-T-A'
genome_build = 'GRCh38'

and

variant = '11-5248232-T-A'
genome_build = 'GRCh37'

If so, this is not correct. As @leicray states correctly 11-5248232-T-A in GRCh38 is equivalent to 11-5269462-T-A in GRCh37.

I would expect

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

To miss all transcripts

But

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5269462-T-A' # Note the changed coordinate because the position is different in GRCh37
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

To hit ENST00000330597.3

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5248232-T-A'
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

May well hit HBB because it is a different variant to 11-5248232-T-A GRCh38

Look here

        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000011.9:g.5269462T>A", # position
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5269462",
                    "ref": "T"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000011.10:g.5248232T>A", # position
                "vcf": {
                    "alt": "A",
                    "chr": "11",
                    "pos": "5248232",
                    "ref": "T"
                }
            },

@Peter-J-Freeman

So when using

variant = '11-5248232-T-A'
genome_build = 'GRCh38'

I get ENST00000330597.3 and when it's GRCh37, I get two different transcripts ENST00000485743.1 and ENST00000335295.4 which from what you've said is not the expected output for both cases.

Peter-J-Freeman commented 2 years ago

Yeah, so that's a bit weird. I will take a look at where this is being picked up then. Which branch please @sbenny1230

John-F-Wagstaff commented 2 years ago

With a query of variant = '11-5248232-T-A' and genome_build = 'GRCh38' you would expect to see ENST00000330597.3 as in NC_000011.10 ENST00000330597.3 has a span of 5248082 to 5249892

With a query of variant = '11-5248232-T-A' and genome_build = 'GRCh37' you would expect to see ENST00000335295.4 as in NC_000011.9 ENST00000335295.4 has a span of 5246693 to 5248301 and ENST00000485743.1 as in NC_000011.9 ENST00000485743.1 has a span of 5247492 to 5248302.

@sbenny1230 Are you sure you are switching the variants location to match the genome as well as the genome? it should be either

variant = '11-5248232-T-A' 
genome_build = 'GRCh38'

or

variant = '11-5269462-T-A' 
genome_build = 'GRCh37'

not any other combination. As @leicray said earlier the sequence of the reference has been updated between versions, and this moves the coordinates of nearly everything.

@Peter-J-Freeman if we can confirm that this is happening with the correct locations to match the genome build then it looks like something is causing GRCh37 coordinates to be lifted over to GRCh38, but then passed down to the lower levels of the code still tagged as GRCh37 locations? or something like that

sbenny1230 commented 2 years ago

@John-F-Wagstaff Hi, I got both of the expected outputs you mentioned. I hadn't changed the variant input to take into account the change in the location for GRCh37.

@Peter-J-Freeman Latest changes are on the branch vv_ensembl_dev_susmi

Peter-J-Freeman commented 2 years ago

@John-F-Wagstaff and @sbenny1230

So, I'm looking at the code and I can confirm I am running

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5248232-T-A'
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

The genomic variant is converted to NC_000011.10:g.5248232T>A which looks correct

When we search VVTA using

hdp.get_tx_for_region(hgvs_genomic.ac, alt_aln_method,
                                              hgvs_genomic.posedit.pos.start.base-1,
                                              hgvs_genomic.posedit.pos.end.base-1)

which is here

the response is

[['ENST00000330597.3', 'NC_000011.10', -1, 'genebuild', 5248082, 5249892]]

This looks like VVTA has mapping data for 'ENST00000330597.3', 'NC_000011.10'

If I run

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '11-5269462-T-A' # Note the changed coordinate because the position is different in GRCh37
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

I get

NC_000011.9:g.5269462T>A
[['ENST00000330597.3', 'NC_000011.9', -1, 'genebuild', 5269312, 5271122]]

Which shows GRCh37 mapping data for ENST00000330597.3

https://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000213934;r=11:5248269-5249857;t=ENST00000330597 shows that the GRCh38 version should most likely be ENST00000330597.5

https://grch37.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000213934;r=11:5269313-5271122;t=ENST00000330597 shows the .3 version.

So my guess is that since we take VVTA data straight from Ensembl files, this is an instance where a transcript has been aligned to both genome builds without a version history revision???????

For further evidence I did

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000011.10:g.5249804T>A' # variant 1
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

and got 3 transcripts for GRCh38, .3 and .5 and ENST00000632727.1

NC_000011.10:g.5249804T>A
[['ENST00000330597.3', 'NC_000011.10', -1, 'genebuild', 5248082, 5249892], 
['ENST00000330597.5', 'NC_000011.10', -1, 'genebuild', 5248268, 5249857], 
['ENST00000632727.1', 'NC_000011.10', -1, 'genebuild', 5248427, 5249836]]

But only 2 should be expected, .5 and the ENST00000632727.1 transcript

image

So as discussed @sbenny1230 , .5 is in the database and is aligned to GRCh38, its just that NC_000011.10:g.5248232T>A does not overlap with the transcript, but NC_000011.10:g.5249804T>A does

If I try

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000011.9:g.5271034T>A' # variant 1
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'ensembl'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

I get

NC_000011.9:g.5271034T>A
[['ENST00000330597.3', 'NC_000011.9', -1, 'genebuild', 5269312, 5271122]]

Which is expected.

So, my overall conclusion is that we must have found alignment data for 'ENST00000330597.3', 'NC_000011.10' from somewhere, likely in an ensembl alignment file. The questions are, why has this happened because I thought Ensembl enforced a version number increase when a new genome build was released because the alignment had to change based on new genomic coordinates. Also, is the sequence of ENST00000330597.3 GRCh38 == ENST00000330597.3 GRCh37 (John, is this one of those badly handled transcripts you found previously?). Finally, where did this 'ENST00000330597.3', 'NC_000011.10' alignment come from????? We could do to track down the file. I will start digging this afternoon.

Peter-J-Freeman commented 2 years ago

Here is the HBG1 gene data from Ensembl build 94 GRCh38

http://ftp.ensembl.org/pub/release-94/gff3/homo_sapiens/

11  ensembl_havana  gene    5248079 5249859 .   -   .   ID=gene:ENSG00000213934;Name=HBG1;biotype=protein_coding;description=hemoglobin subunit gamma 1 [Source:HGNC Symbol%3BAcc:HGNC:4831];gene_id=ENSG00000213934;logic_name=ensembl_havana_gene;version=8
11  ensembl_havana  mRNA    5248079 5249859 .   -   .   ID=transcript:ENST00000330597;Parent=gene:ENSG00000213934;Name=HBG1-201;biotype=protein_coding;ccdsid=CCDS7754.1;tag=basic;transcript_id=ENST00000330597;transcript_support_level=1 (assigned to previous version 3);version=4
11  ensembl_havana  three_prime_UTR 5248079 5248358 .   -   .   Parent=transcript:ENST00000330597
11  ensembl_havana  exon    5248079 5248487 .   -   .   Parent=transcript:ENST00000330597;Name=ENSE00001873394;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSE00001873394;rank=3;version=2
11  ensembl_havana  CDS 5248359 5248487 .   -   0   ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11  ensembl_havana  exon    5249368 5249590 .   -   .   Parent=transcript:ENST00000330597;Name=ENSE00001598355;constitutive=0;ensembl_end_phase=0;ensembl_phase=2;exon_id=ENSE00001598355;rank=2;version=1
11  ensembl_havana  CDS 5249368 5249590 .   -   1   ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11  ensembl_havana  CDS 5249713 5249804 .   -   0   ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11  ensembl_havana  exon    5249713 5249859 .   -   .   Parent=transcript:ENST00000330597;Name=ENSE00001956149;constitutive=0;ensembl_end_phase=2;ensembl_phase=-1;exon_id=ENSE00001956149;rank=1;version=2
11  ensembl_havana  five_prime_UTR  5249805 5249859 .   -   .   Parent=transcript:ENST00000330597
11  havana  lnc_RNA 5248270 5249855 .   -   .   ID=transcript:ENST00000648735;Parent=gene:ENSG00000213934;Name=HBG1-203;biotype=retained_intron;transcript_id=ENST00000648735;version=1
11  havana  exon    5248270 5249590 .   -   .   Parent=transcript:ENST00000648735;Name=ENSE00003835834;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003835834;rank=2;version=1
11  havana  exon    5249713 5249855 .   -   .   Parent=transcript:ENST00000648735;Name=ENSE00003840601;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003840601;rank=1;version=1
11  havana  mRNA    5248428 5249836 .   -   .   ID=transcript:ENST00000632727;Parent=gene:ENSG00000213934;Name=HBG1-202;biotype=protein_coding;tag=basic;transcript_id=ENST00000632727;transcript_support_level=3;version=1
11  havana  exon    5248428 5248487 .   -   .   Parent=transcript:ENST00000632727;Name=ENSE00003780536;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003780536;rank=3;version=1
11  havana  three_prime_UTR 5248428 5248487 .   -   .   Parent=transcript:ENST00000632727
11  havana  three_prime_UTR 5249368 5249551 .   -   .   Parent=transcript:ENST00000632727
11  havana  exon    5249368 5249590 .   -   .   Parent=transcript:ENST00000632727;Name=ENSE00003783162;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSE00003783162;rank=2;version=1
11  havana  CDS 5249552 5249590 .   -   0   ID=CDS:ENSP00000488759;Parent=transcript:ENST00000632727;protein_id=ENSP00000488759
11  havana  CDS 5249751 5249804 .   -   0   ID=CDS:ENSP00000488759;Parent=transcript:ENST00000632727;protein_id=ENSP00000488759
11  havana  exon    5249751 5249836 .   -   .   Parent=transcript:ENST00000632727;Name=ENSE00003775891;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=ENSE00003775891;rank=1;version=1
11  havana  five_prime_UTR  5249805 5249836 .   -   .   Parent=transcript:ENST00000632727

And from build 90

http://ftp.ensembl.org/pub/release-90/gff3/homo_sapiens/

Build 90 IS used in VVTA https://github.com/openvar/VVTA/tree/master/EnsemblUTA/ensembl_90

11  ensembl_havana  gene    5248083 5249892 .   -   .   ID=gene:ENSG00000213934;Name=HBG1;biotype=protein_coding;description=hemoglobin subunit gamma 1 [Source:HGNC Symbol%3BAcc:HGNC:4831];gene_id=ENSG00000213934;logic_name=ensembl_havana_gene;version=6
11  ensembl_havana  mRNA    5248083 5249892 .   -   .   ID=transcript:ENST00000330597;Parent=gene:ENSG00000213934;Name=HBG1-201;biotype=protein_coding;ccdsid=CCDS7754.1;tag=basic;transcript_id=ENST00000330597;transcript_support_level=1;version=3
11  ensembl_havana  three_prime_UTR 5248083 5248358 .   -   .   Parent=transcript:ENST00000330597
11  ensembl_havana  exon    5248083 5248487 .   -   .   Parent=transcript:ENST00000330597;Name=ENSE00001873394;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSE00001873394;rank=3;version=1
11  ensembl_havana  CDS 5248359 5248487 .   -   0   ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11  ensembl_havana  exon    5249368 5249590 .   -   .   Parent=transcript:ENST00000330597;Name=ENSE00001598355;constitutive=0;ensembl_end_phase=0;ensembl_phase=2;exon_id=ENSE00001598355;rank=2;version=1
11  ensembl_havana  CDS 5249368 5249590 .   -   1   ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11  ensembl_havana  CDS 5249713 5249804 .   -   0   ID=CDS:ENSP00000327431;Parent=transcript:ENST00000330597;protein_id=ENSP00000327431
11  ensembl_havana  exon    5249713 5249892 .   -   .   Parent=transcript:ENST00000330597;Name=ENSE00001956149;constitutive=0;ensembl_end_phase=2;ensembl_phase=-1;exon_id=ENSE00001956149;rank=1;version=1
11  ensembl_havana  five_prime_UTR  5249805 5249892 .   -   .   Parent=transcript:ENST00000330597
11  havana  mRNA    5248428 5249836 .   -   .   ID=transcript:ENST00000632727;Parent=gene:ENSG00000213934;Name=HBG1-202;biotype=protein_coding;tag=basic;transcript_id=ENST00000632727;transcript_support_level=3;version=1
11  havana  exon    5248428 5248487 .   -   .   Parent=transcript:ENST00000632727;Name=ENSE00003780536;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003780536;rank=3;version=1
11  havana  three_prime_UTR 5248428 5248487 .   -   .   Parent=transcript:ENST00000632727
11  havana  three_prime_UTR 5249368 5249551 .   -   .   Parent=transcript:ENST00000632727
11  havana  exon    5249368 5249590 .   -   .   Parent=transcript:ENST00000632727;Name=ENSE00003783162;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSE00003783162;rank=2;version=1
11  havana  CDS 5249552 5249590 .   -   0   ID=CDS:ENSP00000488759;Parent=transcript:ENST00000632727;protein_id=ENSP00000488759
11  havana  CDS 5249751 5249804 .   -   0   ID=CDS:ENSP00000488759;Parent=transcript:ENST00000632727;protein_id=ENSP00000488759
11  havana  exon    5249751 5249836 .   -   .   Parent=transcript:ENST00000632727;Name=ENSE00003775891;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=ENSE00003775891;rank=1;version=1
11  havana  five_prime_UTR  5249805 5249836 .   -   .   Parent=transcript:ENST00000632727

These lines are the smoking guns

11  ensembl_havana  mRNA    5248083 5249892 .   -   .   ID=transcript:ENST00000330597;Parent=gene:ENSG00000213934;Name=HBG1-201;biotype=protein_coding;ccdsid=CCDS7754.1;tag=basic;transcript_id=ENST00000330597;transcript_support_level=1;version=3

and

11  ensembl_havana  mRNA    5248079 5249859 .   -   .   ID=transcript:ENST00000330597;Parent=gene:ENSG00000213934;Name=HBG1-201;biotype=protein_coding;ccdsid=CCDS7754.1;tag=basic;transcript_id=ENST00000330597;transcript_support_level=1 (assigned to previous version 3);version=4

Clearly the .3 version "WAS" aligned to GRCh38 in build 90, and there seems to have been a build v .4 by by build 94 (Not in VVTA).

Just gonna check the sequences from GRCh38 and GRCh37, which is now completed and the sequences are identical. So, I am now really stumped as to when Ensembl enforce version updates. I assumed that a genome build would cause an update because genomic coords must change. Any ideas @leicray, @ifokkema @sbenny1230 @John-F-Wagstaff

John-F-Wagstaff commented 2 years ago

@Peter-J-Freeman

I thought Ensembl enforced a version number increase when a new genome build was released because the alignment had to change based on new genomic coordinates.

No they do not, at the very least if the exon start stop positions within the transcript are the same, and the sequence is the same, they keep the version number equal or at least have done so as a matter of course. If they align to an alt they have a completely new ID, nowadays, but not for version updates of the same sequence. I understood originally that the version number would go up for any sequence / alignment change wrt the transcript. Unfortunately as we have discovered it is in fact worse than this, probably only the alignment structure, i.e. within transcript exon position or transcript length changes that will cause a version update, not sequence changes.

We know this because one of the things we had to handle with the new VVTA release was transcripts with the same name and version but different sequences for the GRCh37 and GRCh38 builds. This causes problems since, if you are not mad, you would naturally would assume same id same version number means same sequence and everything code wise, whether it was written by us or others, assumes this is true. We suffixed all of these to stop the problem, but this still means that if your variant maps to one of these transcripts you will have to use the refSeq version, as the Ensembl version is not hgvs compliant.

leicray commented 2 years ago

I'm losing the will to live.

Peter-J-Freeman commented 2 years ago

Hahaha. Yep, that's an appropriate response.

Anyway, fact is @sbenny1230 I should have said, your code is working perfectly. Carry on :

Peter-J-Freeman commented 2 years ago

@John-F-Wagstaff

This is not actually bad news. The liftover tool works much faster if it can find a transcript aligned to both genome builds.

I think what we need to consider is adding all Ensembl builds to VVTA. May be a bit of a task, but it will fill in multiple blanks for ensembl transcripts which have been rapidly updated e.g. ENST00000330597.3 > .4 > .5. Ensembl do not provide a decent revision history , but having this data in VVTA would allow us to provide one.

John-F-Wagstaff commented 2 years ago

We have discussed this as one of our long term plans, time/funding permitting, before. This would take some time to work through but is a solid doable feature, the only problem is that as we add more we slow the database down a bit and fatten it up too. As such I also suggested we might want to create a full Ensembl archive, for interested users, instead, separate from the normal version. Doing something similar for RefSeq too, if we can get the archived RefSeq alignments, was also discussed, but given the time implications of either no concrete plans where made, just long term maybes.