gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
361 stars 76 forks source link

Issue with reference annotation genes with identical coordinates #430

Open N-Hoffmann opened 1 month ago

N-Hoffmann commented 1 month ago

Hello @gpertea, thanks for developing stringtie!

I am working on implementing stringtie2 in a nextflow pipeline to produce extended annotations based on a user-provided reference annotation and long-read sequencing data. We use stringtie to assemble novel genes/transcripts from a BAM file that we then add to the annotation.

I am running into an issue when I'm using an annotation from RefSeq or EnsEMBL concerning overlapping reference genes with two distinct reference gene_id. It seems that stringtie removes one of the two genes in its output. This is the case for instance with "CHTF8" and "DERPC" (respectively RefSeq NM_001039690 and NM_001366606; and Gencode ENSG00000168802.14 and ENSG00000286140.2).

Capture d’écran 2024-06-04 à 15 05 13

When I run stringtie using this annotation, I only have DERPC in the output GTF. Ideally, I would like to have both genes in the output of stringtie.

I am using v.2.2.3 using this command:

stringtie \
-L \
-p ${params.maxCpu} \
-G ${ref} \
-o ${bam.baseName}_stringtie_assemble.gtf \
${bam}

Do you have any advice or suggestions for this case ? Thank you very much !