Closed martynakgajos closed 5 years ago
Hi @martynakgajos .. the memory requirements of AlignQC are something I unfortunately have not had time to revisit. Sorry for some off-the cuff guesses, but if you have modest memory available i.e. 20GB then you should be able to handle small batches of long reads fine, ... ie. a few thousand reads, but > 100k reads, the memory requirements may be considerably higher .. i.e. 100GB or more. This especially makes running large sequencing runs like illumina hiseq memory prohibitive. If you have a large batch of reads and memory issues, I would recommend downsampling your reads prior to processing if you want to look at things like the error profile. Also multiprocessing in AlignQC is not implemented very nicely, so it does not use shared memory objects, this means for each multiprocessing you add you need that additional amount of memory available. ... So my main recommendations to address this are to a) declare the number of threads you tell it use for multiprocessing and make this number small or b) downsample the input reads.
So I guess dealing with almost 3 million long reads, subsampling is my only option.
Thats the easiest approach @martynakgajos , next best option would be to run with a single thread and on a machine with a lot of memory and see how it goes, but this would probably take days to run.
I was finally able to run it in reasonable time (74 minutes, 35 GB) for 1% of the reads. For 10% of the reads, I wasn't seeing any progress after 3 days (max memory usage: 410 GB) and the traverse_preprocessed.py seemed to be the problematic part for the bigger sample. However, I really love the insight to my data that the reports give me, thank you!
I keep getting Memory Error in the function traverse_preprocessed.py . How can I properly estimate the memory requirements for AlignQC pipeline given the amount of reads? I am looking forward to the results ;)