gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
378 stars 78 forks source link

Segmentation Fault with annotation file #203

Open EreboPSilva opened 6 years ago

EreboPSilva commented 6 years ago

I'm opening this issue despite having commented in the old issue about this theme, since I'm not entirely sure on how commenting closed issues works. So I'd like to apologize if by this I'm simply being repetitive. I'll paste my original message and a link to the old issue:

Hi! I'm experiencing this same error. For details, I'm using 1.3.4d version. In my case the problem occurs always in the same Scaffold as well. I've check the reads (in the sorted BAM) of said Scaffold and all looks nice. The GTF file should be working fine since I use the very same file for a different process (same genome file, same gtf, only changing the .bam) and it worked. Also, in my case, no temporal result is created (a file is created, but it's totally empty, 0 kb). The .bam was created using STAR and it's a sorted by coordinates BAM.

I'm including the verbose of the run for clarification: Running StringTie 1.3.4d. Command line: stringtie ravr_30_a/Aligned.sortedByCoord.out.bam -v -o ravr_30_a_clout/transcripts.gtf -p 4 -G ravr.gff -A ravr_30_a_gene_abund.tab [10/22 17:16:28] Loading reference annotation (guides).. [10/22 17:16:28] 198 reference transcripts loaded. Default stack size for threads: 8388608 [10/22 17:17:00]>bundle Scaffold001:10-9333030 [12512970 alignments (4518214 distinct), 85971 junctions, 20 guides] begins processing... [10/22 17:17:32]>bundle Scaffold002:17-9028361 [11965275 alignments (4799645 distinct), 90368 junctions, 29 guides] begins processing... [10/22 17:17:55]>bundle Scaffold003:14-5402067 [6839606 alignments (2945339 distinct), 50595 junctions, 29 guides] begins processing... [10/22 17:18:11]>bundle Scaffold004:1-4740330 [4902272 alignments (2441825 distinct), 39772 junctions, 28 guides] begins processing... Segmentation fault

Thanks a lot for your help.

gpertea commented 6 years ago

It is good that you re-posted it as a new issue as indeed it made it easier to spot it and I also think your case is not related to the other, closed issue.

It is hard to see what's going on there (the reason for crashing) without the data to reproduce it here in a "debug" environment.. But it looks like you have a bunch of very large bundles, with a suspiciously high number of junctions in each bundle, suggesting bad (noisy) data (alignments), and using -p 4 made stringtie load those 4 bundles and process them all 4 at the same time.. So one thing to check is if the crash isn't caused by an out-of-memory situation -- how much RAM do you have on that machine?

Try running without the -p option (single processing thread), that's the only way to identify a specific bundle that might be causing a crash (please see this document for hints about how to identify a problem bundle and submit data for debugging), and to also lower the memory usage. The fact that there are so many junctions there may also be an indicator that your data may be noisy (low quality alignments, and/or a multiplexed sample? -- please don't use multiplexed data with stringtie!). STAR is nice and fast but it might generate a lot of false positives unless used with more stringent alignment options (cannot help you with STAR options, sorry I don't use it). It looks like each bundle covers almost an entire scaffold by itself, with way too many junctions in each bundle.. You might be able to filter some of the spurious junctions by increasing the value of the -j parameter (see some discussion about lowering memory usage here: https://github.com/gpertea/stringtie/issues/164#issuecomment-363597528). If the problem persists and it's not really memory related (doesn't exhaust your RAM), try to prepare a bundle data for debugging, if you can share it with me.

EreboPSilva commented 6 years ago

We have around 128 Gb of RAM, maybe there is the problem. I've checked it and this is the output now:

jmgps@bq078:sra$ stringtie rvar_30_active/Aligned.sortedByCoord.out.bam -v -o rvar_30_active_clout/transcripts.gtf -G rvar.gff -A rvar_30_active_gene_abund.tab Running StringTie 1.3.4d. Command line: stringtie rvar_30_active/Aligned.sortedByCoord.out.bam -v -o rvar_30_active_clout/transcripts.gtf -G rvar.gff -A rvar_30_active_gene_abund.tab [10/25 17:25:58] Loading reference annotation (guides).. [10/25 17:25:59] 198 reference transcripts loaded. Default stack size for threads: 8388608 [10/25 17:26:34]>bundle Scaffold001:10-9333030 [12512970 alignments (4518214 distinct), 85971 junctions, 20 guides] begins processing... Segmentation fault

I will check both STAR documentation and the document you are linking, and I'll retry.

Thanks!

gpertea commented 6 years ago

128GB should be enough for regular data, especially when using a single thread -- and it should be easy to check how much memory stringtie is using before it crashes -- even just visually using top, but better yet you could run that stringtie command through the "time" program with -f (formatting string) adjusted to include the %M output (maximum resident memory use).