dhimmel commented 3 years ago

We select a single representative genes for groups of alternative allele genes. These groups are based on the upstream alt_allele table, which provides a mapping between gene_ids and alt_allele_group_ids.

However, there appears to be groups of genes that are alternative alleles of each other that are not included in this table. One example is the set of human genes with the symbol GP6. Will elaborate further in subsequent comments.

dhimmel commented 2 years ago

Here's a spreadsheet that lists all of the representative genes with duplicate symbols in homo_sapiens_core_104_38: ensembl-duplicate-genes-not-in-alt-allele.xlsx. The spreadsheet contains two sheets. The first summarizes duplicate symbols. For example, the GP6 row is:

gene_symbol	n_genes	n_primary_assembly_genes	examples
GP6	9	1	ENSG00000088053, ENSG00000274050, ENSG00000274566

The second lists all representative ensembl genes with duplicate symbols:

ensembl_gene_id	gene_symbol	gene_symbol_source_db	gene_biotype	chromosome	seq_region_exc_type	seq_region	primary_assembly	mhc
ENSG00000088053	GP6	HGNC	protein_coding	19		19	1	no
ENSG00000274050	GP6	EntrezGene	protein_coding	19	HAP	CHR_HSCHR19LRC_LRC_T_CTG3_1	0	no
ENSG00000274566	GP6	EntrezGene	protein_coding	19	HAP	CHR_HSCHR19LRC_LRC_S_CTG3_1	0	no
ENSG00000275633	GP6	EntrezGene	protein_coding	19	HAP	CHR_HSCHR19LRC_COX2_CTG3_1	0	no
ENSG00000275931	GP6	EntrezGene	protein_coding	19	HAP	CHR_HSCHR19LRC_PGF2_CTG3_1	0	no
ENSG00000276065	GP6	EntrezGene	protein_coding	19	HAP	CHR_HSCHR19LRC_LRC_J_CTG3_1	0	no
ENSG00000276211	GP6	EntrezGene	protein_coding	19	HAP	CHR_HSCHR19LRC_COX1_CTG3_1	0	no
ENSG00000278316	GP6	EntrezGene	protein_coding	19	HAP	CHR_HSCHR19LRC_PGF1_CTG3_1	0	no
ENSG00000278670	GP6	EntrezGene	protein_coding	19	HAP	CHR_HSCHR19LRC_LRC_I_CTG3_1	0	no

Looking at the GP6 genes, it appears ENSG00000088053 should be representative and the rest should be non-representative alt alleles. However, these genes are not grouped in the alt_allele table.

Some stats regarding the extent of duplicated symbols:

there are 232 duplicated symbols in total across 1952 representative ensembl gene IDs
151 of these duplicate symbol gene sets have a single gene on the primary assembly
6 of these duplicate symbol gene sets have zero genes on the primary assembly
75 of these duplicate symbol gene sets have two or more genes on the primary assembly
815 duplicated genes have symbol Y_RNA
211 duplicated genes have symbol Metazoa_SRP

Expand for source code

```py import pandas as pd commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38 url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet" genes_df = pd.read_parquet(url) # filter to representative genes, which restricts to 1 gene per alt_allele group repr_genes_df = genes_df.query("ensembl_gene_id == ensembl_representative_gene_id") # find representative genes with duplicate symbols dupl_repr_genes_df = ( repr_genes_df [repr_genes_df.gene_symbol.duplicated(keep=False)] .sort_values(["gene_symbol", "ensembl_gene_id"]) [["ensembl_gene_id", "gene_symbol", "gene_symbol_source_db", "gene_biotype", "chromosome", "seq_region_exc_type", "seq_region", "primary_assembly", "mhc"]] ) # summarize number of duplicates for each symbol def summarize(df): return pd.Series({ "n_genes": len(df), "n_primary_assembly_genes": sum(df.primary_assembly), "examples": ", ".join(df.ensembl_gene_id.head(3)), }) dupl_gene_symbol_df = dupl_repr_genes_df.groupby("gene_symbol").apply(summarize).reset_index().sort_values("n_genes", ascending=False) # export to excel writer = pd.ExcelWriter(path="ensembl-duplicate-genes-not-in-alt-allele.xlsx") with writer: dupl_gene_symbol_df.to_excel(writer, sheet_name="symbols", index=False, freeze_panes=(1, 0)) dupl_repr_genes_df.to_excel(writer, sheet_name="genes", index=False, freeze_panes=(1, 0)) ```

michalszpak commented 2 years ago

Unfortunately, we had trouble setting the alt_allele table. This is currently being reviewed for human and mouse by the Havana team.

dhimmel commented 2 years ago

Good to know. Curious whether this will be fixed in time for release 105.

michalszpak commented 2 years ago

I'm afraid it's too late for release 105, as it's scheduled for the next week.

dhimmel commented 2 years ago

Rerunning the snippet above for homo_sapiens_core_106_38 (output data from 753aa431e0b6ded055843720f7778f6bb17757a5), it looks like this issue hasn't been fixed in release 106. See ensembl-106-duplicate-genes-not-in-alt-allele.xlsx.

@michalszpak any updates from the Havana team? Should we switch to using gene symbols to detect alternative alleles rather than the alt_allele table?

Ben-Ensembl commented 2 years ago

Hi @dhimmel - thank you for your patience while we looked into this in more detail. We are currently working to fix the issue with the alt_allele table in Ensembl 108, which is scheduled for later this year.

dhimmel commented 2 years ago

We are currently working to fix the issue with the alt_allele table in Ensembl 108, which is scheduled for later this year.

Great to hear! Thanks @Ben-Ensembl for looking into this.

dhimmel commented 2 years ago

Ensembl 108

Was excited to see ensembl genes 108 released today! I reran the analysis above based on the output in 57e3c3a5e8d9e3740a48401a4dd1696eda2a4bf9 from homo_sapiens_core_108_38 to produce ensembl-108-duplicate-genes-not-in-alt-allele.xlsx.

In Ensembl 104, there were 151 groups of genes with duplicate symbols that were not grouped by the alt_allele table and where only a single gene was on the primary assembly. This number is down to 18, so that is a positive development! Here are some examples from those 18:

ensembl_gene_id	gene_symbol	gene_symbol_source_db	gene_biotype	chromosome	seq_region_exc_type	seq_region	primary_assembly	mhc
ENSG00000276070	CCL4L2	HGNC	protein_coding	17		17	TRUE	no
ENSG00000276125	CCL4L2	HGNC	protein_coding	17	HAP	CHR_HSCHR17_7_CTG4	FALSE	no
ENSG00000176797	DEFB103A	HGNC	protein_coding	8		8	TRUE	no
ENSG00000285376	DEFB103A	HGNC	protein_coding	8	PATCH_FIX	CHR_HG76_PATCH	FALSE	no
ENSG00000278662	GOLGA6L10	HGNC	protein_coding	15		15	TRUE	no
ENSG00000274042	GOLGA6L10	EntrezGene	protein_coding	15	HAP	CHR_HSCHR15_5_CTG8	FALSE	no

Question: should these examples also be added to the alt_allele table? What prevented them from being grouped via alt alleles in release 108?

There is a new source of duplicate gene symbols that are not grouped by the alt_allele table related to this change from https://www.ensembl.info/2022/10/20/ensembl-108-has-been-released/:

The transcribed_unprocessed_pseudogene biotype was changed to have separate pseudogene and lncRNA gene IDs in human and mouse, which increased the number of lncRNA genes in the total gene count.

Examples:

ensembl_gene_id	gene_symbol	gene_symbol_source_db	gene_biotype	chromosome	seq_region	primary_assembly	mhc
ENSG00000256069	A2MP1	HGNC	transcribed_unprocessed_pseudogene	12	12	TRUE	no
ENSG00000291190	A2MP1	EntrezGene	lncRNA	12	12	TRUE	no
ENSG00000250420	AACSP1	HGNC	transcribed_unprocessed_pseudogene	5	5	TRUE	no
ENSG00000291019	AACSP1	HGNC	lncRNA	5	5	TRUE	no
ENSG00000240602	AADACP1	HGNC	transcribed_unprocessed_pseudogene	3	3	TRUE	no
ENSG00000291076	AADACP1	EntrezGene	lncRNA	3	3	TRUE	no

Question: what is the rationale for having two gene records for transcribed unprocessed pseudogenes? If a biologist mentions A2MP1, which one are they referring to? Are they conceptually different? Should a single gene be selected as representative via the alt_allele table for these cases?

There are also some additional cases with duplicate symbols that do not fall into these two classes such as the plentiful Y_RNA and Metazoa_SRP symbols as well as examples like:

ensembl_gene_id	gene_symbol	gene_symbol_source_db	gene_biotype	chromosome	seq_region_exc_type	seq_region	primary_assembly	mhc
ENSG00000256150	ITFG2-AS1	HGNC	lncRNA	12		12	TRUE	no
ENSG00000258325	ITFG2-AS1	HGNC	lncRNA	12		12	TRUE	no

General question: our goal is to create a gene catalog with alternative alleles collapsed onto representative genes. Ideally there would be one representative ensembl gene per symbol, since anything else tends to mess up downstream bioinformatics analyses by duplicating certain genes and confusing users. Should we switch from using the alt_allele table to detect groups from which to select representative genes to just grouping by symbols? I'm asking primarily about human.

@Ben-Ensembl are you able to help me with these questions? Thanks ahead of time.

amushtaq102 commented 2 years ago

Hi @dhimmel

Thank you for your patience whilst we looked into your query. Please find below comments from the Ensembl Genebuild and HAVANA team.

Q1: When I looked into the missing alt alleles for release 108, my search was based on gene symbols and genomic locations. As I am aware that gene symbols are not always assigned correctly by the Xref pipeline (question 2 is an example of that), I set a stringent filter so that there should be only an alt allele for each reference allele on each alternate region. In addition, the alt allele location on the alternate region should map to the location of the reference allele on the reference chromosome. In the examples provided by you, another gene having the same symbol in the same alternate region has already been added to the alt_allele table. For instance, there is another DEFB103A alt allele gene on CHR_HG76_PATCH, and another GOLGA6L10 alt allele gene on CHR_HSCHR15_5_CTG8. The missing CCL4L2 alt allele on CHR_HSCHR17_7_CTG4 seems a genuine error and we will look into it.

Q2: This is an unintended consequence of having split transcribed_unprocessed_pseudogene genes into their pseudogene and lncRNA components as separate genes. The gene symbol should be assigned to the pseudogene and the lncRNA should have a separate symbol. Seems that the Xref pipeline doesn’t like this . We wanted to simplify our biotypes as we had a large number of different pseudogene biotypes. Moving forward all pseudogene objects will be “pseudogene” with an attribute to add extra information, which is “transcribed” in this case. The lncRNA’s were also getting ignored as they were hidden within the transcribed pseudogene object as they were seen as pseudogenes. This issue must be fixed but we cannot give an exact timeline for this, as it may require a modification of the Xref pipeline.

Regarding ITFG2-AS1 as another example of missing alt alleles. In this case, both genes are annotated on chromosome 12 and overlap each other. It seems that they could be merged. In principle, alt_allele search avoided calling alt_allele genes on the same reference chromosome. My understanding is that the alt_allele table should link genes between reference and alternate regions. Going forward, I think that dubious cases like this should be assessed manually by Havana.

Q3: We plan to revise the alt alleles for release 110 with the addition of new patch regions from GRCh38.p14. For instance, it is unlikely that we add alt alleles for the Y_RNA and Metazoa_SRP symbols and other symbols for small RNA genes that can be found multiple times on the primary assembly. I don't think we will add the transcribed unprocessed pseudogenes from Q2 either. From that point of view, if you must choose a representative gene for every gene symbol, you may need to switch from relying on the alt_allele table to doing your own gene grouping by symbol.

I hope this helps, Ensembl team

dhimmel commented 8 months ago

Thanks @amushtaq102 and Ensembl Team for that nice explanation.

I've updated the analysis for the human release 111: ensembl-111-duplicate-genes-not-in-alt-allele.xlsx. After excluding genes grouped by the alt_allele table, there are now 507 symbols that are duplicated across ensembl gene IDs.

Many of these are the transcribed_unprocessed_pseudogene / lncRNA pairs you touch on. However, there are still many examples that it seems the alt_allele table should pick up, like where only a single ensembl gene is on the primary assembly from the symbol group.

if you must choose a representative gene for every gene symbol, you may need to switch from relying on the alt_allele table to doing your own gene grouping by symbol

The key question we're after is whether the ensembl genes with duplicate symbols conceptually represent a single gene or multiple. I agree the Y_RNA and Metazoa_SRP symbols could be referring to conceptually different genes. But I think the vast majority of the 507 duplicated symbols that are not resolved in the alt_allele table are conceptually singular gene entities. So I'm inclined to group by symbol to effectively broaden the coverage of the alt_allele table. The problem we encounter is that we want a catalog of human genes and it is confusing to users when a single symbol appears as multiple genes.

Ideally there'd be a way to identify the valid exceptions to the one-symbol-one-gene paradigm like the small RNA genes. But absent that, I think our users will benefit more from no duplicated gene symbols among representative genes, even if that is technically incorrect for situations like Y_RNA and Metazoa_SRP.

Expand for code

```py import pandas as pd commit = "8f756b808e7ef75eaa3e32c35b4e7bb1c594ff26" # homo_sapiens_core_111_38 url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet" genes_df = pd.read_parquet(url) # filter to representative genes, which restricts to 1 gene per alt_allele group repr_genes_df = genes_df.query("ensembl_gene_id == ensembl_representative_gene_id") # find representative genes with duplicate symbols dupl_repr_genes_df = ( repr_genes_df [repr_genes_df.gene_symbol.duplicated(keep=False)] .sort_values(["gene_symbol", "ensembl_gene_id"]) [["ensembl_gene_id", "gene_symbol", "gene_symbol_source_db", "gene_biotype", "chromosome", "seq_region", "primary_assembly", "mhc"]] ) # summarize number of duplicates for each symbol def summarize(df): return pd.Series({ "n_genes": len(df), "n_primary_assembly_genes": sum(df.primary_assembly), "examples": ", ".join(df.ensembl_gene_id.head(3)), }) dupl_gene_symbol_df = dupl_repr_genes_df.groupby("gene_symbol").apply(summarize).reset_index().sort_values("n_genes", ascending=False) # export to excel writer = pd.ExcelWriter(path="ensembl-duplicate-genes-not-in-alt-allele.xlsx") with writer: dupl_gene_symbol_df.to_excel(writer, sheet_name="symbols", index=False, freeze_panes=(1, 0)) dupl_repr_genes_df.to_excel(writer, sheet_name="genes", index=False, freeze_panes=(1, 0)) ```

related-sciences / ensembl-genes

Ensembl alt_allele tables does not contains all alternative allele gene groups #9

Ensembl 108