Open dhimmel opened 1 year ago
The error occurred for alt_allele_group_id = 44458
:
SELECT
gene.stable_id AS ensembl_gene_id,
alt_allele.alt_allele_group_id,
alt_allele_attrib.attrib = "IS_REPRESENTATIVE" AS alt_allele_is_representative,
assembly_exception.exc_seq_region_id IS NULL AS primary_assembly,
seq_region.name AS seq_region,
alt_allele_attrib.attrib AS alt_allele_attrib,
gene.created_date AS ensembl_created_date
FROM alt_allele
INNER JOIN gene
ON gene.gene_id = alt_allele.gene_id
INNER JOIN alt_allele_attrib
ON alt_allele.alt_allele_id = alt_allele_attrib.alt_allele_id
INNER JOIN seq_region
ON gene.seq_region_id = seq_region.seq_region_id
LEFT JOIN assembly_exception
ON seq_region.seq_region_id = assembly_exception.seq_region_id
-- keep exc_type in (PATCH_FIX, PATCH_NOVEL, HAP)
-- refs internal Related Sciences issue 606.
AND NOT assembly_exception.exc_type <=> "PAR"
-- all genes were current when query was written, ensure this is always the case
WHERE gene.is_current
AND alt_allele_group_id = 44458
ORDER BY alt_allele_group_id, alt_allele_is_representative DESC, primary_assembly DESC, ensembl_created_date, ensembl_gene_id
ensembl_gene_id | alt_allele_group_id | alt_allele_is_representative | primary_assembly | seq_region | alt_allele_attrib | ensembl_created_date |
---|---|---|---|---|---|---|
ENSG00000273592 | 44458 | 0 | 1 | HSCHR19LRC_PGF1_CTG3_1 | AUTOMATICALLY_ASSIGNED | 2014-06-09 10:49:07 |
ENSG00000276935 | 44458 | 0 | 1 | HSCHR19LRC_COX1_CTG3_1 | AUTOMATICALLY_ASSIGNED | 2014-06-09 10:49:07 |
It looks like the problem could be that primary_assembly
is true here, since it looks like both genes are on scaffolds. We detect primary_assembly
based on whether exc_seq_region_id IS NULL
.
UPDATE: Here's the same query on ensembl 109:
ensembl_gene_id | alt_allele_group_id | alt_allele_is_representative | primary_assembly | seq_region | alt_allele_attrib | ensembl_created_date |
---|---|---|---|---|---|---|
ENSG00000273592 | 44458 | 0 | 0 | CHR_HSCHR19LRC_PGF1_CTG3_1 | AUTOMATICALLY_ASSIGNED | 2014-06-09 10:49:07 |
ENSG00000276935 | 44458 | 0 | 0 | CHR_HSCHR19LRC_COX1_CTG3_1 | AUTOMATICALLY_ASSIGNED | 2014-06-09 10:49:07 |
The alt allele group contains the same genes, but primary_assembly
is detected as False. Note that the alt_allele group does not contain all MBOAT7 genes as seen below and discussed previously in https://github.com/related-sciences/ensembl-genes/issues/9.
Here are the results of our genes query filtered for symbol MBOAT7 on release 110 and 109.
ensembl_gene_id | ensembl_gene_version | gene_symbol | gene_symbol_source_db | gene_biotype | ensembl_source | ensembl_created_date | ensembl_modified_date | coord_system_version | coord_system | chromosome | seq_region_exc_type | seq_region | seq_region_start | seq_region_end | seq_region_strand | primary_assembly |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSG00000125505 | 17 | MBOAT7 | HGNC | protein_coding | ensembl_havana | 2008-04-29 11:17:41 | 2018-11-21 17:23:49 | GRCh38 | chromosome | 19 | None | 19 | 54173412 | 54189882 | -1 | 1 |
ENSG00000273592 | 4 | MBOAT7 | HGNC | protein_coding | ensembl_havana | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | scaffold | None | None | HSCHR19LRC_PGF1_CTG3_1 | 148191 | 164846 | -1 | 1 |
ENSG00000274194 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | scaffold | None | None | HSCHR19LRC_PGF2_CTG3_1 | 148221 | 164847 | -1 | 1 |
ENSG00000275118 | 4 | MBOAT7 | HGNC | protein_coding | ensembl_havana | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | scaffold | None | None | HSCHR19_4_CTG3_1 | 148220 | 164780 | -1 | 1 |
ENSG00000276935 | 4 | MBOAT7 | HGNC | protein_coding | ensembl_havana | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | scaffold | None | None | HSCHR19LRC_COX1_CTG3_1 | 148185 | 164679 | -1 | 1 |
ENSG00000277025 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | scaffold | None | None | HSCHR19LRC_LRC_J_CTG3_1 | 148221 | 164847 | -1 | 1 |
ENSG00000277733 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | scaffold | None | None | HSCHR19LRC_LRC_S_CTG3_1 | 148221 | 164847 | -1 | 1 |
ENSG00000277923 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | scaffold | None | None | HSCHR19LRC_LRC_T_CTG3_1 | 148221 | 164847 | -1 | 1 |
ENSG00000278322 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | scaffold | None | None | HSCHR19LRC_COX2_CTG3_1 | 148221 | 164847 | -1 | 1 |
ENSG00000278519 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | scaffold | None | None | HSCHR19LRC_LRC_I_CTG3_1 | 148221 | 164847 | -1 | 1 |
ensembl_gene_id | ensembl_gene_version | gene_symbol | gene_symbol_source_db | gene_biotype | ensembl_source | ensembl_created_date | ensembl_modified_date | coord_system_version | coord_system | chromosome | seq_region_exc_type | seq_region | seq_region_start | seq_region_end | seq_region_strand | primary_assembly |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSG00000125505 | 17 | MBOAT7 | HGNC | protein_coding | ensembl_havana | 2008-04-29 11:17:41 | 2018-11-21 17:23:49 | GRCh38 | chromosome | 19 | None | 19 | 54173412 | 54189882 | -1 | 1 |
ENSG00000273592 | 4 | MBOAT7 | HGNC | protein_coding | ensembl_havana | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | chromosome | 19 | HAP | CHR_HSCHR19LRC_PGF1_CTG3_1 | 54173824 | 54190479 | -1 | 0 |
ENSG00000274194 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | chromosome | 19 | HAP | CHR_HSCHR19LRC_PGF2_CTG3_1 | 54173854 | 54190480 | -1 | 0 |
ENSG00000275118 | 4 | MBOAT7 | HGNC | protein_coding | ensembl_havana | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | chromosome | 19 | HAP | CHR_HSCHR19_4_CTG3_1 | 54173853 | 54190413 | -1 | 0 |
ENSG00000276935 | 4 | MBOAT7 | HGNC | protein_coding | ensembl_havana | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | chromosome | 19 | HAP | CHR_HSCHR19LRC_COX1_CTG3_1 | 54173818 | 54190312 | -1 | 0 |
ENSG00000277025 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | chromosome | 19 | HAP | CHR_HSCHR19LRC_LRC_J_CTG3_1 | 54173854 | 54190480 | -1 | 0 |
ENSG00000277733 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | chromosome | 19 | HAP | CHR_HSCHR19LRC_LRC_S_CTG3_1 | 54173854 | 54190480 | -1 | 0 |
ENSG00000277923 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | chromosome | 19 | HAP | CHR_HSCHR19LRC_LRC_T_CTG3_1 | 54173854 | 54190480 | -1 | 0 |
ENSG00000278322 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | chromosome | 19 | HAP | CHR_HSCHR19LRC_COX2_CTG3_1 | 54173854 | 54190480 | -1 | 0 |
ENSG00000278519 | 4 | MBOAT7 | HGNC | protein_coding | ensembl | 2014-06-09 10:49:07 | 2015-06-01 18:57:05 | GRCh38 | chromosome | 19 | HAP | CHR_HSCHR19LRC_LRC_I_CTG3_1 | 54173854 | 54190480 | -1 | 0 |
In both releases, we can see that ENSG00000125505
is the representative gene (because its the only one on a chromosome instead of a scaffold). What has changed is that:
seq_region_exc_type
is now null instead of HAP
(meaning haplotype).exc_seq_region_id
, we assume they are on the primary assemblyseq_region
values are slightly different: HSCHR19LRC_PGF1_CTG3_1
in 110, CHR_HSCHR19LRC_PGF1_CTG3_1
in 109Looks like the assembly_exception
table for humans in release 110 is empty, as the following returns no results:
SELECT *
FROM assembly_exception
LIMIT 10
The API docs describe the assembly_exception
table as:
Allows multiple sequence regions to point to the same sequence, analogous to a symbolic link in a filesystem pointing to the actual file. This mechanism has been implemented specifically to support haplotypes and PARs, but may be useful for other similar structures in the future.
The "List of species with populated data:" is only Danio rerio (Zebrafish). So is it intentional that this table is no longer in use for humans?
Without this assembly_exception
is it possible to know that seq_region HSCHR19LRC_PGF1_CTG3_1
is on chromosome 19? Is there a species agnostic way to tell whether a gene is on the primary assembly?
Aleena Mushtaq, Ensembl Outreach Officer and orcid:0000-0003-1624-458X
, provided super helpful answers via the Helpdesk, so pasting below:
It seems that the MBOAT7 alt allele group should include more genes. We will add them in the next release, which might be Ensembl 112 as it may be too late for release 111.
Regarding the questions about the assembly exceptions;
Is it intentional that this table is no longer in use for humans?
The assembly_exception
table has been emptied intentionally. This is related to the changes regarding the annotation of the chromosome Y PAR and that patches and haplotypes are only stored as scaffolds. The table was emptied because otherwise the core API could produce unexpected results with the new changes.
As well as causing issues with the core API when dumping out files, the Y PAR-related changes meant that we no longer needed the PAR-based database entries for any API-projecting of X PAR data onto Y PARs.
We see from the GitHub issue that you are trying to write SQL queries for our core db, so if you wanted to be able to pull out/exclude sequence regions relating to PATCH_FIX
and PATCH_NOVEL
regions, you could use the seq_region_attrib
table, where the attrib_type_id
313
relates to assembly patch fixes, and 314
relates to assembly patch novel regions.
Without this assembly_exception is it possible to know that seq_region
HSCHR19LRC_PGF1_CTG3_1
is on chromosome 19?
It is not possible to link an alternate region with the corresponding primary assembly region in the release 110 database. This information will be present in the dna_align_feature
table in the release 111 database.
Is there a species agnostic way to tell whether a gene is on the primary assembly?
In the seq_region_attrib
table, the primary assembly regions have a toplevel
attribute but do NOT have a non_ref
attribute. At the moment, there is no unique attribute for primary assembly regions only. This should work for other species.
Just to add here in case it helps at all, the toplevel
attribute is attrib_type_id
6 in the db, and non_ref
is attrib_type_id
16. Both of which can be found in the attrib_type
database table.
Ensembl 110 was released on 2023-07-17 and includes a note:
ensembl_genes datasets --species=human --release=110
querieshomo_sapiens_core_110_38
and fails with:The two genes are ENSG00000273592 and ENSG00000276935, both which have symbol MBOAT7.