Segmentation fault and memory problem

mukarram-ak commented 7 years ago

Hello,

I am running StringTie for sorted, merged bam files with both guided (-G) and unguided option. Interestingly, I can't get the process done for unstranded bam file (the size is around 260 GB; I have 2 large bam files, pooled by strand: reverse and unstranded), but it's running for the reverse bam file (--rf). Initially I was running with -p 16 option and it was running for 2 hours with 417.84 GB of RAM but stopped with this error in the last line of the log:

/nobackup/private/RNA-seq/bam/callStringtieUnstranded.sh: line 13: 27261 Segmentation fault (core dumped) stringtie -p 16 -o ${DIR_OUT}/unguided.transcripts.gtf -v ${BAM}

I then tried to dump the -p option as I saw the suggestion in one of the issues posted and it seems like it needs more memory:

/nobackup/private/RNA-seq/bam/callStringtieUnstranded.sh: line 13: 20098 Killed stringtie -o ${DIR_OUT}/unguided.transcripts.gtf -v ${BAM} slurmstepd: error: Exceeded step memory limit at some point.

My script is consisting of guided command followed by unguided one. The process for reverse bam file is still on going for the guided part and both errors above were in the unguided part of the script (shouldn't be the case, kind of bypassing the guided one). Both guided and unguided gtf files in the tmp folders are empty (but exist). Is it possible that I can't run stringtie at the same time in the same parent directory (I assume this is not the case as the tmp folders are unique), or I'm missing something here? Why it needs a huge amount of RAM for this case (but not for the stranded one)?

I am using StringTie version 1.3.3 and my bam files were generated by STAR.

Thanks, Kadir

gpertea commented 7 years ago

Wow, you have some mighty large BAM files there. StringTie can usually deal with highly expressed or over-sampled transcripts if there are good quality alignments in that region and a limited number of variants expressed for that gene, but it cannot do much when there is a large influx of spliced alignments with many (potentially spurious) junctions all over the place.. A large amount of messy/tangled alignments can sometimes make the splice graph's complexity grow uncontrollably and consequently the memory usage shoots up. Obviously the unstranded sample is more prone to such alignment issues.

I would suggest taking a look at the alignments in that region which caused StringTie to run out of memory (literally take a look, with IGV or similar). Since you were running with -v for that 2nd command, what are the last stderr lines captured for that run? The genomic region and the number of reads in that region should be shown, and you can actually isolate the alignments from that region in a separate .bam file (see the wiki guide for how to do this). It could be that the aligner made a mess there for a high variability gene..

You could try to perform a more aggressive filtering of the spliced alignments (increase -j and perhaps -c values). However that might cause an underestimation/missing of other real but low-expression isoforms -- so perhaps, as a one time solution, you could run with such filtering parameters only for that region which makes StringTie run out of memory (unfortunately this involves breaking up the .BAM file such that you can still use the default parameters for the rest of the BAM file).

Anyway for now I would be curious if the genomic region where that OOM crash happened has any meaning (could be a gene which has a naturally high genetic variability, like MHC). If you suspect the STAR alignments were too loose in the first place, you could also try to filter them as to discard those with a larger edit distance or with soft trimming above certain value/percentage etc. or with "introns" above certain length. Such a filter could be applied on the fly, while streaming the BAM in a pipeline command (since StringTie can take SAM lines at stdin).

mukarram-ak commented 7 years ago

Thanks a lot for your comprehensive answer. The last stderr line was pointing at the whole chromosome 1, I presume there are many spurious reads/junctions that kind of stitched together (considering it's unstranded as well), hence the large bundle? Is there any way to specify size of bundles, just for a cheat for this kind of case (I presume this might significantly introduce assembly errors)?

Anyways, I have extracted the whole chr1 aligned reads from my bam file (together with the gtf reference) and seems like it's still running for quite a long time without producing any lines in the tmp file. I would guess this it will be similar with the previous runs if I don't filter. I will try to apply the filters (both -j and -c as well as bam filtering) and see what will happen. I assume if I have enough coverage for the regions, then applying -j 3 and -c 7.5 would not give a significant loss for the real, lowly expressed transcripts, or is it too high? In any case, I will update what can I do for this case (hopefully not involving breaking the bam file, although it might be the last resort).

Also I was wondering if there's any plan for implementing strand imputation on StringTie?

gpertea commented 7 years ago

I think having such a large "bundle" (the whole chr1 !?) should be the focus here.. How is that possible? Never seeing something like this before in our data I am inclined to question the quality of those reads or the way the alignments were performed. These are RNA-Seq data, right? Can there be such "noisy", full blown transcription going on there on chr1?

You can also enforce strict overlapping of alignments for bundles (using -g 1), as the default bundle bridging window is 50bp. However this might not help considering the amount of noise that we seem to be dealing with here..

In my opinion the place to start in order to really address this problem would be eliminating the bad reads and/or alignments from that huge BAM file. Maybe you could check, besides large edit distances and/or large soft clipping, perhaps STAR also somehow allowed huge unrealistic "introns" for those read alignments, allowing some of them to stretch through many genes etc. ?

mukarram-ak commented 7 years ago

Sorry for taking too long to answer, I was trying several combinations of parameters and seems like indeed increasing -j and -c works well.

Yes, it is RNA-seq data, but as I said it was pooled files (hence the size). Filtering the bam file might also work, but I guess StringTie already filtered it if there are suspicious alignment (by giving warning and take away those reads, e.g. the ones with short mapping length). Anyways, thanks a lot for your comprehensive thoughts above.

gpertea commented 7 years ago

Indeed you mentioned pooling but somehow I missed that, sorry. I don't have experience with such pooled BAM files (not sure when such a workflow would be applicable), but StringTie was definitely not meant to operate in such a context, so this might be a misapplication of the program. Some core assumptions of StringTie are based around the coherent expression of transcripts within a single RNA-Seq sample (sequencing run). I think expression (abundance) estimation cannot even work in a pooled BAM file (unless all pooled samples are replicates.. but even then, sequencing differences between samples would still distort the estimations). If all you want is the structural assembly of the transcripts I would suggest assembling each RNA-Seq sample separately (which would be easier to parallelize and definitely would require less memory) and then playing with the --merge mode (with the filtering options for that mode).

mukarram-ak commented 7 years ago

Yes, I agree with your points above. Indeed I have tried that without problems (merging individually assembled samples) and I was curious to see if there are some effects that can only be present by pooling the data (e.g. that previous "noises" can actually be real, lowly-expressed transcripts). I guess for now separate assemblies followed by merging is the way to go, then.

gpertea / stringtie

Segmentation fault and memory problem #149