gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
377 stars 78 forks source link

output files shorter than others #319

Closed gianfilippo closed 3 years ago

gianfilippo commented 3 years ago

Hi,

I am running on a large number of files, about 450, and in a few cases the output files, both the "_gene_abund.tab" and the ".gtf" have fewer lines than all the others, causing problems down the line with the merge process. I am running stringtie ver 2.1.4. Command is as follow: stringtie -p 20 -e -G Homo_sapiens.GRCh38.84.gtf -o CGATGT_E9.gtf CGATGT_E9.flt.bam -A CGATGT_E9_gene_abund.tab

Specifically, I end up with 1374653 lines in then GTF for most of the files and in few exceptions I get less. Reprocessing them helped with the issue and now I am left with a single file.

There is no error message and every time I reprocess it I get a different number of lines.

Remapping does not help either

Do you have any suggestions ?

Thanks

gpertea commented 3 years ago

This is quite unusual and unexpected behavior, as using -e option should make the output transcripts be the same as the input transcripts. Since you have now identified a specific bam file where the output is different every time (which is really worrying!), is it possible to share that .bam file with me somehow ? (along with the Homo_sapiens.GRCh38.84.gtf file).

Can you try running that sample again without the -p option, just to see if the output is then as expected? I am concerned that perhaps something goes wrong with the thread handling for that particular bam file and some sort of race condition occurs causing this apparently non-deterministic behavior (which is of course unacceptable and should be fixed).

gianfilippo commented 3 years ago

Hi,

thanks for the suggestion. I am rerunning it without the -p option. I will let you know what I get. I can share this BAM file. How can I upload it ?

Best

gpertea commented 3 years ago

Sent you an email with upload instructions. Is the GTF file the unmodified one from the Ensembl release 84? It's safer if you also upload the one you have there actually, in case it became corrupted or otherwise differs.

gpertea commented 3 years ago

After further investigation it was revealed that the cause of these symptoms was a storage quota limit being hit on a computing cluster. Adding error checking to fprintf() (and fclose()) in StringTie would've likely caught this kind of write errors early and avoided spending time investigating this issue (ouch).