Closed dhimmel closed 2 years ago
These genes code for the same protein product (also reflected by the UniProt mappings). The cross-reference pipeline attempts to compare exon structure and position when mapping RefSeq transcripts. It allows for some mismatches but if a RefSeq mRNA has matching exons with an Ensembl transcript, then they’ll be matched.
SMN1 https://www.ncbi.nlm.nih.gov/gene/6606 survival motor neuron protein isoform d
NP_000335.1 survival motor neuron protein isoform d [Homo sapiens] MAMSSGGSGGGVPEQEDSVLFRRGTGQSDDSDIWDDTALIKAYDKAVASFKHALKNGDICETSGKPKTTPKRKPAKKNKSQKKNTAASLQQWKVGDKCSAIWSEDGCIYPATIASIDFKRETCVVVYTGYGNREEQNLSDLLSPICEVANNIEQNAQENENESQVSTDESENSRSPGNKSDNIKPKSAPWNSFLPPPPPMPGPRLGPGKPGLKFNGPPPPPPPPPPHLLSCWLPPFPSGPPIIPPPPPICPDSLDDADALGSMLISWYMSGYHTGYYMGFRQNQKEGRCSHSLN
SMN2 https://www.ncbi.nlm.nih.gov/gene/6607 survival motor neuron protein isoform d
NP_059107.1 survival motor neuron protein isoform d [Homo sapiens] MAMSSGGSGGGVPEQEDSVLFRRGTGQSDDSDIWDDTALIKAYDKAVASFKHALKNGDICETSGKPKTTPKRKPAKKNKSQKKNTAASLQQWKVGDKCSAIWSEDGCIYPATIASIDFKRETCVVVYTGYGNREEQNLSDLLSPICEVANNIEQNAQENENESQVSTDESENSRSPGNKSDNIKPKSAPWNSFLPPPPPMPGPRLGPGKPGLKFNGPPPPPPPPPPHLLSCWLPPFPSGPPIIPPPPPICPDSLDDADALGSMLISWYMSGYHTGYYMGFRQNQKEGRCSHSLN
Cross-references on haplotypes and patches are projected from the alt_allele on the primary assembly.
Thanks @michalszpak for you help! Much appreciated.
These genes code for the same protein product
Fascinating! I read a bit more about it:
The full-size protein made from the SMN2 gene is identical to the protein made from a similar gene called SMN1; however, only 10 to 15 percent of all functional SMN protein is produced from the SMN2 gene (the rest is produced from the SMN1 gene). Typically, people have two copies of the SMN1 gene and one to two copies of the SMN2 gene in each cell. However, the number of copies of the SMN2 gene varies, with some people having up to eight copies.
So ensembl genes are mapped to NCBI genes using a transcript matching approach, which in the case of ensembl:ENSG00000205571
-to-ncbigene:6606
creates a spurious mapping.
I wonder whether this repository should pick a "primary" mapped NCBI gene for each ensembl gene. When an ensembl gene maps to multiple ncbi genes, we'd compare the ensembl and ncbi gene symbols (gene_symbol
and xref_label
columns above) to select the primary-mapped-ncbigene for each ensembl gene. Any other heuristics we could use to select the most similar ncbi gene from many? Would this work for human, rat, mouse, and beyond?
Another motivation besides removing spurious mappings is that many use cases for mappings benefit from one-to-one mappings. The proposed approach would create many-to-one mappings, which is still preferable to the current many-to-many.
Here's all the instances where the ensembl gene_symbol
does not match the xref_label
(ncbi symbol) for humans release 104: ensembl-gene-ncbi-mapping-symbol-mismatch.xlsx. This dataset is helpful for this issue and #5.
Essentially, Ensembl features are mapped to NCBI features based on sequence matching and mRNA location information, which improves the accuracy of the mapping. Due to intrinsic differences between these annotations and the fact that different loci in the genome might code for the same product, the relationship between Ensembl and NCBI features is not necessarily 1-to-1. If you'd like to further filter these mappings then you'll need to use your own judgement, but it will certainly result in information loss, as some mappings might be equally good (100% sequence identity). Please bear in mind that assigned gene symbols are also external mappings and might be unstable or missing (especially in non-human species). I'd suggest taking into account the location information.
In the homo_sapiens_core_104_38 database, ensembl gene SMN2 (
ENSG00000205571
) maps to two ncbigenes: SMN1 (6606
) and SMN2 (6607
). This can be seen in the following table that shows all ensembl gene mappings to ncbigenes for SMN1 & SMN2:Some notes from the table:
ENSG00000172062
/ SMN1 only maps to SMN1 in ncbigene and not SMN2ENSG00000172062
/ SMN1 has a single non-representative alt-allele, which isENSG00000275349
ENSG00000205571
/ SMN2 has two non-representative alt-alleles, which areENSG00000273772
andENSG00000277773
.ENSG00000205571
should also be applied to the alt alleles.I'll forward this issue to the Ensembl helpdesk to see if they have any insights on why SMN2 is mapping to both SMN1 & SMN2 in ncbigene and whether this is an error that should be fixed.
Python code to generate the table above: