gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

segmentation fault with long read data [ stringtie v2.2.1 ] #356

Closed zhixue closed 2 years ago

zhixue commented 2 years ago

Hi, thanks for the wonderful tool for long RNA reads analysis~

I am trying to run StringTie2 for rice ONT raw reads (fastq reads) by first running the uLTRA aligner (v0.0.4) and then provide generated sorted bam file to StringTie (v2.2.1). I have used IGV to check the bam file, it is ok.

Moreover, I have divided the bam file in different chromosomes to test, and I have found that the 90,001th~95,000th sorted reads cause "segmentation fault".

The bam file is here (5.8M) chr3_head90000_95000.bam.

The commands are as follow:

# ultra
uLTRA align $ref ${sample}.fq ${sample} --index ~/ricerna/ultra_bam/msu7_ultra_index --ont --t ${th} --prefix ${sample}_msu7 --use_NAM_seeds

# stringtie 
~/tool/stringtie-2.2.1.Linux_x86_64/stringtie -p 1 -L -l N1i1Chr3 -o chr3_head90000_95000.gtf chr3_head90000_95000.bam
#### Segmentation fault ####
gpertea commented 2 years ago

Thank you for reporting this and providing the debug data -- it seems that there is a particular BAM record in the uLTRA output that stringtie has trouble parsing properly, I'll be fixing that shortly.

gpertea commented 2 years ago

The problem seems to be related to record 880af412-ef82-474a-9e85-e6df5784e5ac having an alignment that ends with an intron ( the CIGAR string ends with =9I4=1X2=1X5D230N ), which does not quite make sense by itself, unless that is a peculiar way of suggesting that the read alignment ends exactly at an intron boundary ? However, that alignment does not have a transcription strand assigned, which makes that justification rather unlikely.

I can modify StringTie to ignore that kind of unusual alignment (hanging intron with no terminal exon) but I suspect the problem might be deeper, it could be an alignment bug and perhaps it should be reported to the uLTRA aligner author.

Most SAM processing tools seem to silently ignore this issue, including IGV, so I guess I'll do the same (certainly preferable over the current crash due to the unexpected structural anomaly, the number of "exons" vs. the number of introns detected in that alignment).

gpertea commented 2 years ago

addressed by commit 996f585

zhixue commented 2 years ago

Thanks for your rapid response!

I have re-downloaded the latest version of stringtie and run this sample successfully! I have also reported this case to the uLTRA aligner author.

Maybe the output of uLTRA has something unexpected in SAM format, because I have another sample causing "segmentation fault". With the similar way, I have located the trouble at part of alignment records in Chr1, but I can not infer more.

The bam file is here (28K) Sample2_Chr1head10900_11000.bam.

The commands are as follow:

# stringtie 
~/tool/stringtie_996f585/stringtie -p 1 -L -l S2c1 -o Sample2_Chr1head10900_11000.gtf Sample2_Chr1head10900_11000.bam
#### Segmentation fault ####
gpertea commented 2 years ago

Hmm, this was the same issue of a hanging intron with no terminal exon, but this time capped by a insertion (the CIGAR of read 03debcb9-2135-431b-bbf0-ff10c64983d1 ends with 1X3=3I1=59N2I)

I'll add a more robust check there: if there is no M/X/= preceding the first intron (N) or following the last intron, such intron should be discarded.

gpertea commented 2 years ago

Addressed by 62551bb.

zhixue commented 2 years ago

It works. Thank you~