gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
381 stars 78 forks source link

when dealing with big genomes, the error occurs: "the input alignment file is not sorted!" #391

Open dujuanhua6308 opened 1 year ago

dujuanhua6308 commented 1 year ago

Hi, I met the similar problem when I use stringtie: "Error: the input alignment file is not sorted! Alignments (965520) already found for chr1_1 !". I was working with the genome of Pinus tabliformis for mapping, which was more than 20GB and each chromosome was more than 2GB. My colleague deal with this problem by dividing each chromsome into two parts and no error again. But since the genome she worked with, was small and each chromsome was about 1GB, my chromesome was so big, if should I follow her method to dividing the chromsome into at least four parts? Hope to get your response and if there are other methods?

gpertea commented 1 year ago

I see, I have to switch to 64-bit integers for all the genomic coordinates in the data structures I'm using. Until then, indeed the workaround is to split every scaffold larger than 2GB into <2GB long parts. I know it's a terrible workaround but that's the only one I can think of right now before a proper fix in the StringTie code, sorry.

dujuanhua6308 commented 1 year ago

Thank you for your kind reply. Actually, we found the scaffold shoud be less than 600M, then the flow of "hiasta2-samtool sort-stringtie" will work. What also puzzles me is that using the big genome without scaffold spliting, the samtools sorted bam file also fail to index using samtools, but the precedure "sam file to bam file" is ok and I checked the bam file and found nothing wrong. I'm not sure whether the big genome is also not fit for the samtools sort. What do you think about this problem. Thank you for your help.

cynthiawebster commented 1 year ago

Hi @dujuanhua6308, I had the same error running hisat2-samtools-stringtie on a 7GB genome. Originally I was running samtools/htslib version 1.9, but resolved the problem by switching to the newest version (1.17). This update provides support for very long input lines (> 2Gbyte). Hope this helps!