jason-weirather / AlignQC

Long read alignment analysis. Generate a reports on sequence alignments for mappability vs read sizes, error patterns, annotations and rarefraction curve analysis. The most basic analysis only requires a BAM file, and outputs a web browser compatible xhtml to visualize/share/store/extract analysis results.
Apache License 2.0
45 stars 10 forks source link

Memory Requirements #22

Closed martynakgajos closed 5 years ago

martynakgajos commented 5 years ago

I keep getting Memory Error in the function traverse_preprocessed.py . How can I properly estimate the memory requirements for AlignQC pipeline given the amount of reads? I am looking forward to the results ;)

jason-weirather commented 5 years ago

Hi @martynakgajos .. the memory requirements of AlignQC are something I unfortunately have not had time to revisit. Sorry for some off-the cuff guesses, but if you have modest memory available i.e. 20GB then you should be able to handle small batches of long reads fine, ... ie. a few thousand reads, but > 100k reads, the memory requirements may be considerably higher .. i.e. 100GB or more. This especially makes running large sequencing runs like illumina hiseq memory prohibitive. If you have a large batch of reads and memory issues, I would recommend downsampling your reads prior to processing if you want to look at things like the error profile. Also multiprocessing in AlignQC is not implemented very nicely, so it does not use shared memory objects, this means for each multiprocessing you add you need that additional amount of memory available. ... So my main recommendations to address this are to a) declare the number of threads you tell it use for multiprocessing and make this number small or b) downsample the input reads.

martynakgajos commented 5 years ago

So I guess dealing with almost 3 million long reads, subsampling is my only option.

jason-weirather commented 5 years ago

Thats the easiest approach @martynakgajos , next best option would be to run with a single thread and on a machine with a lot of memory and see how it goes, but this would probably take days to run.

martynakgajos commented 5 years ago

I was finally able to run it in reasonable time (74 minutes, 35 GB) for 1% of the reads. For 10% of the reads, I wasn't seeing any progress after 3 days (max memory usage: 410 GB) and the traverse_preprocessed.py seemed to be the problematic part for the bigger sample. However, I really love the insight to my data that the reports give me, thank you!