Open dhimmel opened 3 years ago
Here's a spreadsheet that lists all of the representative genes with duplicate symbols in homo_sapiens_core_104_38
: ensembl-duplicate-genes-not-in-alt-allele.xlsx. The spreadsheet contains two sheets. The first summarizes duplicate symbols. For example, the GP6 row is:
gene_symbol | n_genes | n_primary_assembly_genes | examples |
---|---|---|---|
GP6 | 9 | 1 | ENSG00000088053, ENSG00000274050, ENSG00000274566 |
The second lists all representative ensembl genes with duplicate symbols:
ensembl_gene_id | gene_symbol | gene_symbol_source_db | gene_biotype | chromosome | seq_region_exc_type | seq_region | primary_assembly | mhc |
---|---|---|---|---|---|---|---|---|
ENSG00000088053 | GP6 | HGNC | protein_coding | 19 | 19 | 1 | no | |
ENSG00000274050 | GP6 | EntrezGene | protein_coding | 19 | HAP | CHR_HSCHR19LRC_LRC_T_CTG3_1 | 0 | no |
ENSG00000274566 | GP6 | EntrezGene | protein_coding | 19 | HAP | CHR_HSCHR19LRC_LRC_S_CTG3_1 | 0 | no |
ENSG00000275633 | GP6 | EntrezGene | protein_coding | 19 | HAP | CHR_HSCHR19LRC_COX2_CTG3_1 | 0 | no |
ENSG00000275931 | GP6 | EntrezGene | protein_coding | 19 | HAP | CHR_HSCHR19LRC_PGF2_CTG3_1 | 0 | no |
ENSG00000276065 | GP6 | EntrezGene | protein_coding | 19 | HAP | CHR_HSCHR19LRC_LRC_J_CTG3_1 | 0 | no |
ENSG00000276211 | GP6 | EntrezGene | protein_coding | 19 | HAP | CHR_HSCHR19LRC_COX1_CTG3_1 | 0 | no |
ENSG00000278316 | GP6 | EntrezGene | protein_coding | 19 | HAP | CHR_HSCHR19LRC_PGF1_CTG3_1 | 0 | no |
ENSG00000278670 | GP6 | EntrezGene | protein_coding | 19 | HAP | CHR_HSCHR19LRC_LRC_I_CTG3_1 | 0 | no |
Looking at the GP6 genes, it appears ENSG00000088053
should be representative and the rest should be non-representative alt alleles. However, these genes are not grouped in the alt_allele table.
Some stats regarding the extent of duplicated symbols:
Unfortunately, we had trouble setting the alt_allele table. This is currently being reviewed for human and mouse by the Havana team.
Good to know. Curious whether this will be fixed in time for release 105.
I'm afraid it's too late for release 105, as it's scheduled for the next week.
Rerunning the snippet above for homo_sapiens_core_106_38
(output data from 753aa431e0b6ded055843720f7778f6bb17757a5), it looks like this issue hasn't been fixed in release 106. See ensembl-106-duplicate-genes-not-in-alt-allele.xlsx
.
@michalszpak any updates from the Havana team? Should we switch to using gene symbols to detect alternative alleles rather than the alt_allele table?
Hi @dhimmel - thank you for your patience while we looked into this in more detail. We are currently working to fix the issue with the alt_allele table in Ensembl 108, which is scheduled for later this year.
We are currently working to fix the issue with the alt_allele table in Ensembl 108, which is scheduled for later this year.
Great to hear! Thanks @Ben-Ensembl for looking into this.
Was excited to see ensembl genes 108 released today! I reran the analysis above based on the output in 57e3c3a5e8d9e3740a48401a4dd1696eda2a4bf9 from homo_sapiens_core_108_38
to produce ensembl-108-duplicate-genes-not-in-alt-allele.xlsx.
In Ensembl 104, there were 151 groups of genes with duplicate symbols that were not grouped by the alt_allele
table and where only a single gene was on the primary assembly. This number is down to 18, so that is a positive development! Here are some examples from those 18:
ensembl_gene_id | gene_symbol | gene_symbol_source_db | gene_biotype | chromosome | seq_region_exc_type | seq_region | primary_assembly | mhc |
---|---|---|---|---|---|---|---|---|
ENSG00000276070 | CCL4L2 | HGNC | protein_coding | 17 | 17 | TRUE | no | |
ENSG00000276125 | CCL4L2 | HGNC | protein_coding | 17 | HAP | CHR_HSCHR17_7_CTG4 | FALSE | no |
ENSG00000176797 | DEFB103A | HGNC | protein_coding | 8 | 8 | TRUE | no | |
ENSG00000285376 | DEFB103A | HGNC | protein_coding | 8 | PATCH_FIX | CHR_HG76_PATCH | FALSE | no |
ENSG00000278662 | GOLGA6L10 | HGNC | protein_coding | 15 | 15 | TRUE | no | |
ENSG00000274042 | GOLGA6L10 | EntrezGene | protein_coding | 15 | HAP | CHR_HSCHR15_5_CTG8 | FALSE | no |
Question: should these examples also be added to the alt_allele table? What prevented them from being grouped via alt alleles in release 108?
There is a new source of duplicate gene symbols that are not grouped by the alt_allele
table related to this change from https://www.ensembl.info/2022/10/20/ensembl-108-has-been-released/:
The
transcribed_unprocessed_pseudogene
biotype was changed to have separate pseudogene and lncRNA gene IDs in human and mouse, which increased the number of lncRNA genes in the total gene count.
Examples:
ensembl_gene_id | gene_symbol | gene_symbol_source_db | gene_biotype | chromosome | seq_region_exc_type | seq_region | primary_assembly | mhc |
---|---|---|---|---|---|---|---|---|
ENSG00000256069 | A2MP1 | HGNC | transcribed_unprocessed_pseudogene | 12 | 12 | TRUE | no | |
ENSG00000291190 | A2MP1 | EntrezGene | lncRNA | 12 | 12 | TRUE | no | |
ENSG00000250420 | AACSP1 | HGNC | transcribed_unprocessed_pseudogene | 5 | 5 | TRUE | no | |
ENSG00000291019 | AACSP1 | HGNC | lncRNA | 5 | 5 | TRUE | no | |
ENSG00000240602 | AADACP1 | HGNC | transcribed_unprocessed_pseudogene | 3 | 3 | TRUE | no | |
ENSG00000291076 | AADACP1 | EntrezGene | lncRNA | 3 | 3 | TRUE | no |
Question: what is the rationale for having two gene records for transcribed unprocessed pseudogenes? If a biologist mentions A2MP1, which one are they referring to? Are they conceptually different? Should a single gene be selected as representative via the alt_allele table for these cases?
There are also some additional cases with duplicate symbols that do not fall into these two classes such as the plentiful Y_RNA
and Metazoa_SRP
symbols as well as examples like:
ensembl_gene_id | gene_symbol | gene_symbol_source_db | gene_biotype | chromosome | seq_region_exc_type | seq_region | primary_assembly | mhc |
---|---|---|---|---|---|---|---|---|
ENSG00000256150 | ITFG2-AS1 | HGNC | lncRNA | 12 | 12 | TRUE | no | |
ENSG00000258325 | ITFG2-AS1 | HGNC | lncRNA | 12 | 12 | TRUE | no |
General question: our goal is to create a gene catalog with alternative alleles collapsed onto representative genes. Ideally there would be one representative ensembl gene per symbol, since anything else tends to mess up downstream bioinformatics analyses by duplicating certain genes and confusing users. Should we switch from using the alt_allele
table to detect groups from which to select representative genes to just grouping by symbols? I'm asking primarily about human.
@Ben-Ensembl are you able to help me with these questions? Thanks ahead of time.
Hi @dhimmel
Thank you for your patience whilst we looked into your query. Please find below comments from the Ensembl Genebuild and HAVANA team.
Q1: When I looked into the missing alt alleles for release 108, my search was based on gene symbols and genomic locations. As I am aware that gene symbols are not always assigned correctly by the Xref pipeline (question 2 is an example of that), I set a stringent filter so that there should be only an alt allele for each reference allele on each alternate region. In addition, the alt allele location on the alternate region should map to the location of the reference allele on the reference chromosome. In the examples provided by you, another gene having the same symbol in the same alternate region has already been added to the alt_allele table. For instance, there is another DEFB103A alt allele gene on CHR_HG76_PATCH, and another GOLGA6L10 alt allele gene on CHR_HSCHR15_5_CTG8. The missing CCL4L2 alt allele on CHR_HSCHR17_7_CTG4 seems a genuine error and we will look into it.
Q2: This is an unintended consequence of having split transcribed_unprocessed_pseudogene genes into their pseudogene and lncRNA components as separate genes. The gene symbol should be assigned to the pseudogene and the lncRNA should have a separate symbol. Seems that the Xref pipeline doesn’t like this . We wanted to simplify our biotypes as we had a large number of different pseudogene biotypes. Moving forward all pseudogene objects will be “pseudogene” with an attribute to add extra information, which is “transcribed” in this case. The lncRNA’s were also getting ignored as they were hidden within the transcribed pseudogene object as they were seen as pseudogenes. This issue must be fixed but we cannot give an exact timeline for this, as it may require a modification of the Xref pipeline.
Regarding ITFG2-AS1 as another example of missing alt alleles. In this case, both genes are annotated on chromosome 12 and overlap each other. It seems that they could be merged. In principle, alt_allele search avoided calling alt_allele genes on the same reference chromosome. My understanding is that the alt_allele table should link genes between reference and alternate regions. Going forward, I think that dubious cases like this should be assessed manually by Havana.
Q3: We plan to revise the alt alleles for release 110 with the addition of new patch regions from GRCh38.p14. For instance, it is unlikely that we add alt alleles for the Y_RNA and Metazoa_SRP symbols and other symbols for small RNA genes that can be found multiple times on the primary assembly. I don't think we will add the transcribed unprocessed pseudogenes from Q2 either. From that point of view, if you must choose a representative gene for every gene symbol, you may need to switch from relying on the alt_allele table to doing your own gene grouping by symbol.
I hope this helps, Ensembl team
Thanks @amushtaq102 and Ensembl Team for that nice explanation.
I've updated the analysis for the human release 111: ensembl-111-duplicate-genes-not-in-alt-allele.xlsx
. After excluding genes grouped by the alt_allele
table, there are now 507 symbols that are duplicated across ensembl gene IDs.
Many of these are the transcribed_unprocessed_pseudogene / lncRNA pairs you touch on. However, there are still many examples that it seems the alt_allele table should pick up, like where only a single ensembl gene is on the primary assembly from the symbol group.
if you must choose a representative gene for every gene symbol, you may need to switch from relying on the alt_allele table to doing your own gene grouping by symbol
The key question we're after is whether the ensembl genes with duplicate symbols conceptually represent a single gene or multiple. I agree the Y_RNA and Metazoa_SRP symbols could be referring to conceptually different genes. But I think the vast majority of the 507 duplicated symbols that are not resolved in the alt_allele
table are conceptually singular gene entities. So I'm inclined to group by symbol to effectively broaden the coverage of the alt_allele
table. The problem we encounter is that we want a catalog of human genes and it is confusing to users when a single symbol appears as multiple genes.
Ideally there'd be a way to identify the valid exceptions to the one-symbol-one-gene paradigm like the small RNA genes. But absent that, I think our users will benefit more from no duplicated gene symbols among representative genes, even if that is technically incorrect for situations like Y_RNA and Metazoa_SRP.
We select a single representative genes for groups of alternative allele genes. These groups are based on the upstream
alt_allele
table, which provides a mapping betweengene_id
s andalt_allele_group_id
s.However, there appears to be groups of genes that are alternative alleles of each other that are not included in this table. One example is the set of human genes with the symbol GP6. Will elaborate further in subsequent comments.