imgag / ngs-bits

Short-read sequencing tools
MIT License
138 stars 30 forks source link

NGSDImportEnsembl: fix SNORD118 transcripts on several chromosomes #497

Closed alex-seitz closed 8 months ago

alex-seitz commented 8 months ago

Hi, we realized an error on the parsing of the Ensembl into the NGSD database. There are several genes in the gff-file named U8.1 up to U8.22. These are different genes on different chromosomes. Apparently the gene SNORD118 has an alias of U8 and somehow all of these transcripts are matched with the SNORD118. Additionally, the gene U8 apperears more than once in ENSEMBL, e.g. under the following IDS: ENSG00000199713 ENSG00000239148

Is there any way to remedy this?

Best, Alex

marc-sturm commented 8 months ago

I will have a look at it asap

MarvinDo commented 8 months ago

Here is another example which possibly boils down to the same problem: The Gene DDX11L2 (ENSG00000236397) has two transcripts on two different chromosomes in NGSD: ensgid version source chromosome strand biotype ENST00000437401 1 ensembl 2 - unprocessed pseudogene ENST00000456328 2 ensembl 1 + lncRNA On the Ensembl they map to two different genes: ENSG00000236397 & ENSG00000290825

marc-sturm commented 8 months ago

Fixed in commit: bdd495a9473f70c1384e1345978a113e9b84a2d6

marc-sturm commented 8 months ago

There are still genes with transcripts on several chromosomes: ANKRD20A5P, DDX11L16, DDX11L2, LSP1P5, RPL23AP7, SNORA62, SNORA63, SNORA70, SNORA72, SNORA75, SNORD27, SNORD30, SNORD33, SNORD63, SNORD81

However they are not fixable easily, as they are caused by several Ensembl genes with the same HGNC-approved gene name, e.g.: www.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000456328 http://www.ensembl.org/Homo_sapiens/Transcript/Summary?g=ENSG00000236397;r=2:113599036-113601261;t=ENST00000437401