related-sciences / ensembl-genes

Extract the Ensembl genes catalog to simple tables
Other
17 stars 4 forks source link

Ensembl human release 110: assembly_exception table is empty #22

Open dhimmel opened 1 year ago

dhimmel commented 1 year ago

Ensembl 110 was released on 2023-07-17 and includes a note:

human genome assembly has been updated to the latest patch release GRCh38.p14. Note, however, that genes on patches will only appear on scaffold coordinates. Further, in the GFF3 annotation files, you will now find that MANE and Ensembl canonical attributes have been added as tags. Y pseudoautosomal region (PAR) genes are now stand-alone genes and are no longer taken from X, but MANE attributes remain on X PAR genes only.

ensembl_genes datasets --species=human --release=110 queries homo_sapiens_core_110_38 and fails with:

ValueError: expected at most 1 primary assembly gene per alt_allele_group

The two genes are ENSG00000273592 and ENSG00000276935, both which have symbol MBOAT7.

dhimmel commented 1 year ago

The error occurred for alt_allele_group_id = 44458:

SELECT
  gene.stable_id AS ensembl_gene_id,
  alt_allele.alt_allele_group_id,
  alt_allele_attrib.attrib = "IS_REPRESENTATIVE" AS alt_allele_is_representative,
  assembly_exception.exc_seq_region_id IS NULL AS primary_assembly,
  seq_region.name AS seq_region,
  alt_allele_attrib.attrib AS alt_allele_attrib,
  gene.created_date AS ensembl_created_date
FROM alt_allele
INNER JOIN gene
  ON gene.gene_id = alt_allele.gene_id
INNER JOIN alt_allele_attrib
  ON alt_allele.alt_allele_id = alt_allele_attrib.alt_allele_id
INNER JOIN seq_region
  ON gene.seq_region_id = seq_region.seq_region_id
LEFT JOIN assembly_exception
  ON seq_region.seq_region_id = assembly_exception.seq_region_id
  -- keep exc_type in (PATCH_FIX, PATCH_NOVEL, HAP)
  -- refs internal Related Sciences issue 606.
  AND NOT assembly_exception.exc_type <=> "PAR"
-- all genes were current when query was written, ensure this is always the case
WHERE gene.is_current
  AND alt_allele_group_id = 44458
ORDER BY alt_allele_group_id, alt_allele_is_representative DESC, primary_assembly DESC, ensembl_created_date, ensembl_gene_id
ensembl_gene_id alt_allele_group_id alt_allele_is_representative primary_assembly seq_region alt_allele_attrib ensembl_created_date
ENSG00000273592 44458 0 1 HSCHR19LRC_PGF1_CTG3_1 AUTOMATICALLY_ASSIGNED 2014-06-09 10:49:07
ENSG00000276935 44458 0 1 HSCHR19LRC_COX1_CTG3_1 AUTOMATICALLY_ASSIGNED 2014-06-09 10:49:07

It looks like the problem could be that primary_assembly is true here, since it looks like both genes are on scaffolds. We detect primary_assembly based on whether exc_seq_region_id IS NULL.

UPDATE: Here's the same query on ensembl 109:

ensembl_gene_id alt_allele_group_id alt_allele_is_representative primary_assembly seq_region alt_allele_attrib ensembl_created_date
ENSG00000273592 44458 0 0 CHR_HSCHR19LRC_PGF1_CTG3_1 AUTOMATICALLY_ASSIGNED 2014-06-09 10:49:07
ENSG00000276935 44458 0 0 CHR_HSCHR19LRC_COX1_CTG3_1 AUTOMATICALLY_ASSIGNED 2014-06-09 10:49:07

The alt allele group contains the same genes, but primary_assembly is detected as False. Note that the alt_allele group does not contain all MBOAT7 genes as seen below and discussed previously in https://github.com/related-sciences/ensembl-genes/issues/9.

dhimmel commented 1 year ago

Here are the results of our genes query filtered for symbol MBOAT7 on release 110 and 109.

release 110

ensembl_gene_id ensembl_gene_version gene_symbol gene_symbol_source_db gene_biotype ensembl_source ensembl_created_date ensembl_modified_date coord_system_version coord_system chromosome seq_region_exc_type seq_region seq_region_start seq_region_end seq_region_strand primary_assembly
ENSG00000125505 17 MBOAT7 HGNC protein_coding ensembl_havana 2008-04-29 11:17:41 2018-11-21 17:23:49 GRCh38 chromosome 19 None 19 54173412 54189882 -1 1
ENSG00000273592 4 MBOAT7 HGNC protein_coding ensembl_havana 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 scaffold None None HSCHR19LRC_PGF1_CTG3_1 148191 164846 -1 1
ENSG00000274194 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 scaffold None None HSCHR19LRC_PGF2_CTG3_1 148221 164847 -1 1
ENSG00000275118 4 MBOAT7 HGNC protein_coding ensembl_havana 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 scaffold None None HSCHR19_4_CTG3_1 148220 164780 -1 1
ENSG00000276935 4 MBOAT7 HGNC protein_coding ensembl_havana 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 scaffold None None HSCHR19LRC_COX1_CTG3_1 148185 164679 -1 1
ENSG00000277025 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 scaffold None None HSCHR19LRC_LRC_J_CTG3_1 148221 164847 -1 1
ENSG00000277733 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 scaffold None None HSCHR19LRC_LRC_S_CTG3_1 148221 164847 -1 1
ENSG00000277923 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 scaffold None None HSCHR19LRC_LRC_T_CTG3_1 148221 164847 -1 1
ENSG00000278322 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 scaffold None None HSCHR19LRC_COX2_CTG3_1 148221 164847 -1 1
ENSG00000278519 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 scaffold None None HSCHR19LRC_LRC_I_CTG3_1 148221 164847 -1 1

release 109

ensembl_gene_id ensembl_gene_version gene_symbol gene_symbol_source_db gene_biotype ensembl_source ensembl_created_date ensembl_modified_date coord_system_version coord_system chromosome seq_region_exc_type seq_region seq_region_start seq_region_end seq_region_strand primary_assembly
ENSG00000125505 17 MBOAT7 HGNC protein_coding ensembl_havana 2008-04-29 11:17:41 2018-11-21 17:23:49 GRCh38 chromosome 19 None 19 54173412 54189882 -1 1
ENSG00000273592 4 MBOAT7 HGNC protein_coding ensembl_havana 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 chromosome 19 HAP CHR_HSCHR19LRC_PGF1_CTG3_1 54173824 54190479 -1 0
ENSG00000274194 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 chromosome 19 HAP CHR_HSCHR19LRC_PGF2_CTG3_1 54173854 54190480 -1 0
ENSG00000275118 4 MBOAT7 HGNC protein_coding ensembl_havana 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 chromosome 19 HAP CHR_HSCHR19_4_CTG3_1 54173853 54190413 -1 0
ENSG00000276935 4 MBOAT7 HGNC protein_coding ensembl_havana 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 chromosome 19 HAP CHR_HSCHR19LRC_COX1_CTG3_1 54173818 54190312 -1 0
ENSG00000277025 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 chromosome 19 HAP CHR_HSCHR19LRC_LRC_J_CTG3_1 54173854 54190480 -1 0
ENSG00000277733 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 chromosome 19 HAP CHR_HSCHR19LRC_LRC_S_CTG3_1 54173854 54190480 -1 0
ENSG00000277923 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 chromosome 19 HAP CHR_HSCHR19LRC_LRC_T_CTG3_1 54173854 54190480 -1 0
ENSG00000278322 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 chromosome 19 HAP CHR_HSCHR19LRC_COX2_CTG3_1 54173854 54190480 -1 0
ENSG00000278519 4 MBOAT7 HGNC protein_coding ensembl 2014-06-09 10:49:07 2015-06-01 18:57:05 GRCh38 chromosome 19 HAP CHR_HSCHR19LRC_LRC_I_CTG3_1 54173854 54190480 -1 0

In both releases, we can see that ENSG00000125505 is the representative gene (because its the only one on a chromosome instead of a scaffold). What has changed is that:

dhimmel commented 1 year ago

Looks like the assembly_exception table for humans in release 110 is empty, as the following returns no results:

SELECT *
FROM assembly_exception
LIMIT 10

The API docs describe the assembly_exception table as:

Allows multiple sequence regions to point to the same sequence, analogous to a symbolic link in a filesystem pointing to the actual file. This mechanism has been implemented specifically to support haplotypes and PARs, but may be useful for other similar structures in the future.

The "List of species with populated data:" is only Danio rerio (Zebrafish). So is it intentional that this table is no longer in use for humans?

Without this assembly_exception is it possible to know that seq_region HSCHR19LRC_PGF1_CTG3_1 is on chromosome 19? Is there a species agnostic way to tell whether a gene is on the primary assembly?

dhimmel commented 1 year ago

Aleena Mushtaq, Ensembl Outreach Officer and orcid:0000-0003-1624-458X, provided super helpful answers via the Helpdesk, so pasting below:


It seems that the MBOAT7 alt allele group should include more genes. We will add them in the next release, which might be Ensembl 112 as it may be too late for release 111.

Regarding the questions about the assembly exceptions;

Is it intentional that this table is no longer in use for humans?

The assembly_exception table has been emptied intentionally. This is related to the changes regarding the annotation of the chromosome Y PAR and that patches and haplotypes are only stored as scaffolds. The table was emptied because otherwise the core API could produce unexpected results with the new changes.

As well as causing issues with the core API when dumping out files, the Y PAR-related changes meant that we no longer needed the PAR-based database entries for any API-projecting of X PAR data onto Y PARs.

We see from the GitHub issue that you are trying to write SQL queries for our core db, so if you wanted to be able to pull out/exclude sequence regions relating to PATCH_FIX and PATCH_NOVEL regions, you could use the seq_region_attrib table, where the attrib_type_id 313 relates to assembly patch fixes, and 314 relates to assembly patch novel regions.

Without this assembly_exception is it possible to know that seq_region HSCHR19LRC_PGF1_CTG3_1 is on chromosome 19?

It is not possible to link an alternate region with the corresponding primary assembly region in the release 110 database. This information will be present in the dna_align_feature table in the release 111 database.

Is there a species agnostic way to tell whether a gene is on the primary assembly?

In the seq_region_attrib table, the primary assembly regions have a toplevel attribute but do NOT have a non_ref attribute. At the moment, there is no unique attribute for primary assembly regions only. This should work for other species.

Just to add here in case it helps at all, the toplevel attribute is attrib_type_id 6 in the db, and non_ref is attrib_type_id 16. Both of which can be found in the attrib_type database table.