Memory usage issue with 1.3.3b

smahaffey commented 6 years ago

We have some larger bam files 40-90GB aligned with Hisat2, that are merged from 3 replicates of the same sample that we've been running stringtie on. We're using version 1.3.3b on RHEL7 with 128GB of RAM. At some point the memory usage increases dramatically and the process gets killed by the OS. It seems like issue #18 but it sounds like that was fixed with 1.1.0. We've run this successfully for a different tissue with similar sized bam files. In this tissue where the error occurs we expect many more splice variants which makes it seem like issue #18 again. We just ran the samples individually and they completed with a max memory usage of 43GB for one set and 112GB for another set of samples.

Thank you for any suggestions you might have.

gpertea commented 6 years ago

This seems more similar to #149 actually - when multiple samples are pooled together (merged) there are some (noisy?!) regions that gather an unusually large number of potential junctions which makes the splice graph "explode" in terms of number or nodes (and thus memory usage). It's not clear to me why that is happening, could be a side effect of oversampling in some areas -- as if somehow "systematic" (?!) outliers which would have been discarded as noise in a single sample suddenly beg for relevance when multiple samples are put together.. Interesting. However if you already expect a very high number of splice variants in an already large sample, that high memory usage seems to be a somewhat expected behavior - the larger the splice graph, the more memory is going to be needed.

Just like in the case of #149, it seems that running the samples individually clears the issue (barely). It'd be interesting to see what's in that set of samples with 112GB max memory usage -- I suppose that means there was one sample where that max memory was used, right?

There was some advice in that thread about things one could do to further investigate (-v, no -p) or even attempt to alleviate the issue for pooled samples (increase -j and perhaps -c values). Running with a single worker thread and with the -v option can capture some useful stats about the bundles (regions) being processed.. The ones with larger size and/or larger number of junctions/reads (and long processing time) would also very likely consume the most memory. We do have an internal debug version that could even print some memory usage information per bundle but just the number of junctions/reads is usually a very good predictor.

Note that simply not using the -p option is another (cheap) way of minimizing memory usage and thus preventing out-of-memory errors -- assuming there is more than one large, highly expressed region in a sample (input BAM).

I don't know what else I can add here, without taking a look at the data.. It would be good to investigate more closely that sample which triggered 112GB memory usage -- but I suppose it would be hard to share that BAM file. There is another potential issue with the way the bundles are created -- if there are some spurious alignments connecting multiple highly-spliced regions (genes) we might end up with very "wide" bundles which can also be very memory consuming. I hope that HiSat2 was not run with very permissive alignment options to allow for low-scoring alignments that could enable such multi-gene bundles in the BAM file.

smahaffey commented 6 years ago

Thank you for the suggestions and quick response. I'm sorry I missed the previous issue and those suggestions. I will repeat it based on all of your feedback. It makes sense that because we're using it with -p 12 or 16 that just using a single thread may prevent the problem altogether.

Yes the 112GB was the max from running 3 of the samples that were previously merged. I don't know which was responsible for the max memory usage but I can rerun it to see how each sample looked for memory usage. We probably could make the bam files available if you don't mind downloading the file. I'll have to check on that.

We have quite a few of these samples and were concerned that even individually some where close to running out of memory. Thank you again, all of this was very helpful so we can decide what to try next.

smahaffey commented 6 years ago

I ran them without the multithreaded option and the memory usage on the individual samples was greatly reduced. 2-30GB per sample with all but one under 10GB. I just reran the merged samples and both that I tested were killed due to memory usage so I'm running with the verbose option so I can provide that. Next I'll look at the -j and -c options. Thank you.

gpertea commented 6 years ago

Thanks for the update -- indeed I would be curious to see what happens there.. if it's an extensive merge (very wide bundle) that is being formed when samples are merged. Actually I now realize that the -c option might not help at all -- I think that's mostly used for discarding spurious but rather isolated read mappings which form small, low-count bundles. Increasing the -j value might be the only way to limit this problem but it comes with the downside of a potential decrease of sensitivity (i.e. the ability to detect low-expression transcripts). For merged samples this might not be a problem though.. Taking a look in IGV at the reported bundle region which is causing the out-of-memory crash might give us an idea about what happens there..

smahaffey commented 6 years ago

Thank you. I reran the merged files now with the verbose output option. Here are the last few lines of both samples. It seems to be something about that region. We can provide one of the .bam files if you want to take a look at it. I think our only feasible option for now is to just run them individually and then merge them. Thank you for your help.

Here are the last few lines of the first sample:

[02/08 15:41:15]>bundle 14:46046217-46046495(4) (0 guides) loaded, begins processing...
[02/08 15:41:15]^bundle 14:46046217-46046495(4) done (0 processed potential transcripts).
[02/08 15:41:15]>bundle 14:46046637-46047043(10) (0 guides) loaded, begins processing...
[02/08 15:41:15]^bundle 14:46046637-46047043(8) done (0 processed potential transcripts).
[02/08 15:41:15]>bundle 14:46047300-46048545(69) (0 guides) loaded, begins processing...
[02/08 15:41:15]^bundle 14:46047300-46048545(52) done (1 processed potential transcripts).
[02/08 15:41:15]>bundle 14:46048616-46134823(11794) (2 guides) loaded, begins processing...
[02/08 15:41:15]^bundle 14:46048616-46134823(5626) done (38 processed potential transcripts).
[02/08 15:41:15]>bundle 14:46134896-46174385(11626) (1 guides) loaded, begins processing...
[02/08 15:41:15]^bundle 14:46134896-46174385(7223) done (13 processed potential transcripts).
[02/08 15:41:15]>bundle 14:46174484-46175393(38) (0 guides) loaded, begins processing...
[02/08 15:41:15]^bundle 14:46174484-46175393(31) done (1 processed potential transcripts).
[02/08 15:41:15]>bundle 14:46175500-46188062(1660) (0 guides) loaded, begins processing...
[02/08 15:41:15]^bundle 14:46175500-46188062(1189) done (2 processed potential transcripts).
[02/08 19:36:29]>bundle 14:46188284-46745884(954834869) (24 guides) loaded, begins processing...

And the sample I'm providing a link to:

[02/08 15:37:51]>bundle 14:46091553-46093036(78) (0 guides) loaded, begins processing...
[02/08 15:37:51]^bundle 14:46091553-46093036(60) done (1 processed potential transcripts).
[02/08 15:37:51]>bundle 14:46093390-46093492(1) (0 guides) loaded, begins processing...
[02/08 15:37:51]^bundle 14:46093390-46093492(1) done (0 processed potential transcripts).
[02/08 15:37:51]>bundle 14:46093770-46103991(421) (0 guides) loaded, begins processing...
[02/08 15:37:51]^bundle 14:46093770-46103991(354) done (4 processed potential transcripts).
[02/08 15:37:51]>bundle 14:46104115-46109335(231) (0 guides) loaded, begins processing...
[02/08 15:37:51]^bundle 14:46104115-46109335(210) done (1 processed potential transcripts).
[02/08 15:37:51]>bundle 14:46109455-46111567(77) (0 guides) loaded, begins processing...
[02/08 15:37:51]^bundle 14:46109455-46111567(68) done (1 processed potential transcripts).
[02/08 15:37:51]>bundle 14:46111665-46111872(5) (0 guides) loaded, begins processing...
[02/08 15:37:51]^bundle 14:46111665-46111872(5) done (0 processed potential transcripts).
[02/08 15:37:51]>bundle 14:46111943-46113136(76) (0 guides) loaded, begins processing...
[02/08 15:37:51]^bundle 14:46111943-46113136(60) done (1 processed potential transcripts).
[02/08 15:37:51]>bundle 14:46113295-46114548(4) (0 guides) loaded, begins processing...
[02/08 15:37:51]^bundle 14:46113295-46114548(4) done (0 processed potential transcripts).
[02/08 17:38:42]>bundle 14:46114655-46745887(550103375) (25 guides) loaded, begins processing...

Here is the link to the smaller file, but still 58GB: http://phenogen.ucdenver.edu/PhenoGen/tmpData/BXH9.bam

gpertea commented 6 years ago

Thank you for the BAM file, I've been looking at it recently and it does look like there are some regions with way too many alignments and distinct junctions. I could not even take a look at the chromosome 14 region as IGV was chocking on it as I was trying to zoom in to see some junctions (too many millions of reads there!). Meanwhile I also had a discussion about this with the main author of StringTie (Mihaela) and she strongly discourages such attempts of running StringTie on any pooled/merged samples. StringTie was designed to analyze (assemble) a single sequencing sample at a time (a single sequencing experiment) as the whole "network flow" algorithm is based on the assumption that the given set of read alignments is observed in a single sequencing run/experiment. Merging the read alignments from multiple experiments together, even when they are meant to be "replicates", is inevitably breaking this assumption by introducing biases and variations which would derail StringTie's assembly algorithm and also compound the "alignment noise" problem severely.

That being said, I was still curious what's in these extremely large bundles so I modified StringTie -v output slightly to show how many distinct read alignments and junctions are there. Here is the relevant output for that bundle:

[02/20 13:02:24]>bundle 14:46114655-46745887 [550103375 alignments (2663243 distinct), 49359 junctions, 0 guides] begins processing...

The process dies on a 512GB server at this point (OOM process killer kicked in). Obviously stringtie chokes on that 630Kb region with 550 million alignments (most of them redundant), but it's the almost 50 thousand junctions which are ruining the memory usage there. I'm going to try increasing the -j value just as a rough junction noise filter -- but anyway, as I mentioned above, there is not much point in wrestling with such merged samples.

smahaffey commented 6 years ago

Thank you for all of your help looking into this. We still had one individual sample that also was killed at 188GB in that same region on chr14. Using -j 2 we could run this sample and it's max memory usage was reduced to 52GB. We want to capture low abundance transcripts and I think that's one of the reasons they were trying to run the merged files to capture as many low abundance transcripts as possible. I see your point about the assumptions used for the algorithm and not using on merged samples. Thank you for all your help.

gpertea / stringtie

Memory usage issue with 1.3.3b #164