homo_sapiens_core_104_38: SMN2 xrefs SMN1 in EntrezGene

dhimmel commented 2 years ago

In the homo_sapiens_core_104_38 database, ensembl gene SMN2 (ENSG00000205571) maps to two ncbigenes: SMN1 (6606) and SMN2 (6607). This can be seen in the following table that shows all ensembl gene mappings to ncbigenes for SMN1 & SMN2:

ensembl_gene_id	gene_symbol	ensembl_representative_gene_id	is_representative	xref_source	xref_accession	xref_label	xref_description	xref_info_type	xref_linkage_annotation
ENSG00000172062	SMN1	ENSG00000172062	True	EntrezGene	6606	SMN1	survival of motor neuron 1, telomeric	DEPENDENT	None
ENSG00000275349	SMN1	ENSG00000172062	False	EntrezGene	6606	SMN1	survival of motor neuron 1, telomeric	DEPENDENT	None
ENSG00000205571	SMN2	ENSG00000205571	True	EntrezGene	6606	SMN1	survival of motor neuron 1, telomeric	DEPENDENT	None
ENSG00000205571	SMN2	ENSG00000205571	True	EntrezGene	6607	SMN2	survival of motor neuron 2, centromeric	DEPENDENT	None
ENSG00000273772	SMN2	ENSG00000205571	False	EntrezGene	6606	SMN1	survival of motor neuron 1, telomeric	DEPENDENT	None
ENSG00000273772	SMN2	ENSG00000205571	False	EntrezGene	6607	SMN2	survival of motor neuron 2, centromeric	DEPENDENT	None
ENSG00000277773	SMN2	ENSG00000205571	False	EntrezGene	6606	SMN1	survival of motor neuron 1, telomeric	DEPENDENT	None
ENSG00000277773	SMN2	ENSG00000205571	False	EntrezGene	6607	SMN2	survival of motor neuron 2, centromeric	DEPENDENT	None

Some notes from the table:

ENSG00000172062 / SMN1 only maps to SMN1 in ncbigene and not SMN2
ENSG00000172062 / SMN1 has a single non-representative alt-allele, which is ENSG00000275349
ENSG00000205571 / SMN2 has two non-representative alt-alleles, which are ENSG00000273772 and ENSG00000277773.
alt alleles have the same mappings as their representative gene. So any fix to the mappings of ENSG00000205571 should also be applied to the alt alleles.

I'll forward this issue to the Ensembl helpdesk to see if they have any insights on why SMN2 is mapping to both SMN1 & SMN2 in ncbigene and whether this is an error that should be fixed.

Python code to generate the table above:

import pandas as pd
commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet"
genes_df = pd.read_parquet(url)
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/xrefs.snappy.parquet"
xrefs_df = pd.read_parquet(url)
smn_symbols = {"SMN1", "SMN2"}
smn_df = (
    xrefs_df
    .query("xref_source == 'EntrezGene'")
    .query("xref_label in @smn_symbols")
)
smn_df = (
    genes_df
    [["ensembl_gene_id", "gene_symbol", "ensembl_representative_gene_id"]]
    .eval("is_representative = ensembl_gene_id == ensembl_representative_gene_id")
    .merge(smn_df)
    .sort_values(["gene_symbol", "ensembl_gene_id"])
)
smn_df

michalszpak commented 2 years ago

These genes code for the same protein product (also reflected by the UniProt mappings). The cross-reference pipeline attempts to compare exon structure and position when mapping RefSeq transcripts. It allows for some mismatches but if a RefSeq mRNA has matching exons with an Ensembl transcript, then they’ll be matched.

SMN1 https://www.ncbi.nlm.nih.gov/gene/6606 survival motor neuron protein isoform d

NP_000335.1 survival motor neuron protein isoform d [Homo sapiens] MAMSSGGSGGGVPEQEDSVLFRRGTGQSDDSDIWDDTALIKAYDKAVASFKHALKNGDICETSGKPKTTPKRKPAKKNKSQKKNTAASLQQWKVGDKCSAIWSEDGCIYPATIASIDFKRETCVVVYTGYGNREEQNLSDLLSPICEVANNIEQNAQENENESQVSTDESENSRSPGNKSDNIKPKSAPWNSFLPPPPPMPGPRLGPGKPGLKFNGPPPPPPPPPPHLLSCWLPPFPSGPPIIPPPPPICPDSLDDADALGSMLISWYMSGYHTGYYMGFRQNQKEGRCSHSLN

SMN2 https://www.ncbi.nlm.nih.gov/gene/6607 survival motor neuron protein isoform d

NP_059107.1 survival motor neuron protein isoform d [Homo sapiens] MAMSSGGSGGGVPEQEDSVLFRRGTGQSDDSDIWDDTALIKAYDKAVASFKHALKNGDICETSGKPKTTPKRKPAKKNKSQKKNTAASLQQWKVGDKCSAIWSEDGCIYPATIASIDFKRETCVVVYTGYGNREEQNLSDLLSPICEVANNIEQNAQENENESQVSTDESENSRSPGNKSDNIKPKSAPWNSFLPPPPPMPGPRLGPGKPGLKFNGPPPPPPPPPPHLLSCWLPPFPSGPPIIPPPPPICPDSLDDADALGSMLISWYMSGYHTGYYMGFRQNQKEGRCSHSLN

Cross-references on haplotypes and patches are projected from the alt_allele on the primary assembly.

dhimmel commented 2 years ago

Thanks @michalszpak for you help! Much appreciated.

These genes code for the same protein product

Fascinating! I read a bit more about it:

The full-size protein made from the SMN2 gene is identical to the protein made from a similar gene called SMN1; however, only 10 to 15 percent of all functional SMN protein is produced from the SMN2 gene (the rest is produced from the SMN1 gene). Typically, people have two copies of the SMN1 gene and one to two copies of the SMN2 gene in each cell. However, the number of copies of the SMN2 gene varies, with some people having up to eight copies.

So ensembl genes are mapped to NCBI genes using a transcript matching approach, which in the case of ensembl:ENSG00000205571-to-ncbigene:6606 creates a spurious mapping.

I wonder whether this repository should pick a "primary" mapped NCBI gene for each ensembl gene. When an ensembl gene maps to multiple ncbi genes, we'd compare the ensembl and ncbi gene symbols (gene_symbol and xref_label columns above) to select the primary-mapped-ncbigene for each ensembl gene. Any other heuristics we could use to select the most similar ncbi gene from many? Would this work for human, rat, mouse, and beyond?

Another motivation besides removing spurious mappings is that many use cases for mappings benefit from one-to-one mappings. The proposed approach would create many-to-one mappings, which is still preferable to the current many-to-many.

dhimmel commented 2 years ago

Here's all the instances where the ensembl gene_symbol does not match the xref_label (ncbi symbol) for humans release 104: ensembl-gene-ncbi-mapping-symbol-mismatch.xlsx. This dataset is helpful for this issue and #5.

Expand for source code

```py import pandas as pd commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38 url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet" genes_df = pd.read_parquet(url) url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/xrefs.snappy.parquet" ncbi_xref_df = pd.read_parquet(url).query("xref_source == 'EntrezGene'") ncbi_xref_df = ( genes_df [["ensembl_gene_id", "gene_symbol", "gene_description", "ensembl_representative_gene_id"]] .eval("is_representative = ensembl_gene_id == ensembl_representative_gene_id") .merge(ncbi_xref_df) .sort_values(["gene_symbol", "ensembl_gene_id"]) ) ( ncbi_xref_df .query("gene_symbol != xref_label") .to_excel("ensembl-gene-ncbi-mapping-symbol-mismatch.xlsx", freeze_panes=(1, 0), index=False) ) ```

michalszpak commented 2 years ago

Essentially, Ensembl features are mapped to NCBI features based on sequence matching and mRNA location information, which improves the accuracy of the mapping. Due to intrinsic differences between these annotations and the fact that different loci in the genome might code for the same product, the relationship between Ensembl and NCBI features is not necessarily 1-to-1. If you'd like to further filter these mappings then you'll need to use your own judgement, but it will certainly result in information loss, as some mappings might be equally good (100% sequence identity). Please bear in mind that assigned gene symbols are also external mappings and might be unstable or missing (especially in non-human species). I'd suggest taking into account the location information.

related-sciences / ensembl-genes

homo_sapiens_core_104_38: SMN2 xrefs SMN1 in EntrezGene #10