related-sciences / ensembl-genes

Extract the Ensembl genes catalog to simple tables
Other
17 stars 4 forks source link

homo_sapiens_core_104_38: SMN2 xrefs SMN1 in EntrezGene #10

Closed dhimmel closed 2 years ago

dhimmel commented 2 years ago

In the homo_sapiens_core_104_38 database, ensembl gene SMN2 (ENSG00000205571) maps to two ncbigenes: SMN1 (6606) and SMN2 (6607). This can be seen in the following table that shows all ensembl gene mappings to ncbigenes for SMN1 & SMN2:

ensembl_gene_id gene_symbol ensembl_representative_gene_id is_representative xref_source xref_accession xref_label xref_description xref_info_type xref_linkage_annotation
ENSG00000172062 SMN1 ENSG00000172062 True EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000275349 SMN1 ENSG00000172062 False EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000205571 SMN2 ENSG00000205571 True EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000205571 SMN2 ENSG00000205571 True EntrezGene 6607 SMN2 survival of motor neuron 2, centromeric DEPENDENT None
ENSG00000273772 SMN2 ENSG00000205571 False EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000273772 SMN2 ENSG00000205571 False EntrezGene 6607 SMN2 survival of motor neuron 2, centromeric DEPENDENT None
ENSG00000277773 SMN2 ENSG00000205571 False EntrezGene 6606 SMN1 survival of motor neuron 1, telomeric DEPENDENT None
ENSG00000277773 SMN2 ENSG00000205571 False EntrezGene 6607 SMN2 survival of motor neuron 2, centromeric DEPENDENT None

Some notes from the table:

I'll forward this issue to the Ensembl helpdesk to see if they have any insights on why SMN2 is mapping to both SMN1 & SMN2 in ncbigene and whether this is an error that should be fixed.

Python code to generate the table above:

import pandas as pd
commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet"
genes_df = pd.read_parquet(url)
url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/xrefs.snappy.parquet"
xrefs_df = pd.read_parquet(url)
smn_symbols = {"SMN1", "SMN2"}
smn_df = (
    xrefs_df
    .query("xref_source == 'EntrezGene'")
    .query("xref_label in @smn_symbols")
)
smn_df = (
    genes_df
    [["ensembl_gene_id", "gene_symbol", "ensembl_representative_gene_id"]]
    .eval("is_representative = ensembl_gene_id == ensembl_representative_gene_id")
    .merge(smn_df)
    .sort_values(["gene_symbol", "ensembl_gene_id"])
)
smn_df
michalszpak commented 2 years ago

These genes code for the same protein product (also reflected by the UniProt mappings). The cross-reference pipeline attempts to compare exon structure and position when mapping RefSeq transcripts. It allows for some mismatches but if a RefSeq mRNA has matching exons with an Ensembl transcript, then they’ll be matched.

SMN1 https://www.ncbi.nlm.nih.gov/gene/6606 survival motor neuron protein isoform d

NP_000335.1 survival motor neuron protein isoform d [Homo sapiens] MAMSSGGSGGGVPEQEDSVLFRRGTGQSDDSDIWDDTALIKAYDKAVASFKHALKNGDICETSGKPKTTPKRKPAKKNKSQKKNTAASLQQWKVGDKCSAIWSEDGCIYPATIASIDFKRETCVVVYTGYGNREEQNLSDLLSPICEVANNIEQNAQENENESQVSTDESENSRSPGNKSDNIKPKSAPWNSFLPPPPPMPGPRLGPGKPGLKFNGPPPPPPPPPPHLLSCWLPPFPSGPPIIPPPPPICPDSLDDADALGSMLISWYMSGYHTGYYMGFRQNQKEGRCSHSLN

SMN2 https://www.ncbi.nlm.nih.gov/gene/6607 survival motor neuron protein isoform d

NP_059107.1 survival motor neuron protein isoform d [Homo sapiens] MAMSSGGSGGGVPEQEDSVLFRRGTGQSDDSDIWDDTALIKAYDKAVASFKHALKNGDICETSGKPKTTPKRKPAKKNKSQKKNTAASLQQWKVGDKCSAIWSEDGCIYPATIASIDFKRETCVVVYTGYGNREEQNLSDLLSPICEVANNIEQNAQENENESQVSTDESENSRSPGNKSDNIKPKSAPWNSFLPPPPPMPGPRLGPGKPGLKFNGPPPPPPPPPPHLLSCWLPPFPSGPPIIPPPPPICPDSLDDADALGSMLISWYMSGYHTGYYMGFRQNQKEGRCSHSLN

Cross-references on haplotypes and patches are projected from the alt_allele on the primary assembly.

dhimmel commented 2 years ago

Thanks @michalszpak for you help! Much appreciated.

These genes code for the same protein product

Fascinating! I read a bit more about it:

The full-size protein made from the SMN2 gene is identical to the protein made from a similar gene called SMN1; however, only 10 to 15 percent of all functional SMN protein is produced from the SMN2 gene (the rest is produced from the SMN1 gene). Typically, people have two copies of the SMN1 gene and one to two copies of the SMN2 gene in each cell. However, the number of copies of the SMN2 gene varies, with some people having up to eight copies.

So ensembl genes are mapped to NCBI genes using a transcript matching approach, which in the case of ensembl:ENSG00000205571-to-ncbigene:6606 creates a spurious mapping.

I wonder whether this repository should pick a "primary" mapped NCBI gene for each ensembl gene. When an ensembl gene maps to multiple ncbi genes, we'd compare the ensembl and ncbi gene symbols (gene_symbol and xref_label columns above) to select the primary-mapped-ncbigene for each ensembl gene. Any other heuristics we could use to select the most similar ncbi gene from many? Would this work for human, rat, mouse, and beyond?

Another motivation besides removing spurious mappings is that many use cases for mappings benefit from one-to-one mappings. The proposed approach would create many-to-one mappings, which is still preferable to the current many-to-many.

dhimmel commented 2 years ago

Here's all the instances where the ensembl gene_symbol does not match the xref_label (ncbi symbol) for humans release 104: ensembl-gene-ncbi-mapping-symbol-mismatch.xlsx. This dataset is helpful for this issue and #5.

Expand for source code ```py import pandas as pd commit = "c87a3194704e073db841c0643f566bc5036e9f75" # homo_sapiens_core_104_38 url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/genes.snappy.parquet" genes_df = pd.read_parquet(url) url = f"https://github.com/related-sciences/ensembl-genes/raw/{commit}/xrefs.snappy.parquet" ncbi_xref_df = pd.read_parquet(url).query("xref_source == 'EntrezGene'") ncbi_xref_df = ( genes_df [["ensembl_gene_id", "gene_symbol", "gene_description", "ensembl_representative_gene_id"]] .eval("is_representative = ensembl_gene_id == ensembl_representative_gene_id") .merge(ncbi_xref_df) .sort_values(["gene_symbol", "ensembl_gene_id"]) ) ( ncbi_xref_df .query("gene_symbol != xref_label") .to_excel("ensembl-gene-ncbi-mapping-symbol-mismatch.xlsx", freeze_panes=(1, 0), index=False) ) ```
michalszpak commented 2 years ago

Essentially, Ensembl features are mapped to NCBI features based on sequence matching and mRNA location information, which improves the accuracy of the mapping. Due to intrinsic differences between these annotations and the fact that different loci in the genome might code for the same product, the relationship between Ensembl and NCBI features is not necessarily 1-to-1. If you'd like to further filter these mappings then you'll need to use your own judgement, but it will certainly result in information loss, as some mappings might be equally good (100% sequence identity). Please bear in mind that assigned gene symbols are also external mappings and might be unstable or missing (especially in non-human species). I'd suggest taking into account the location information.