Open SHuang-Broad opened 4 years ago
Hi Steve, this will be a bit of a hassle to implement, because the read file is in memory during the whole run. Windows which are created only contain pointers to the sequences, so we do not copy the data unnecessary. I guess we could store windows which contain actual sequences to the disc, and then use a different subroutine which will do the multiple sequence alignment. I will have to think about the best way to do this, and I can not guarantee you when this request will be implemented.
Best regards, Robert
Thanks Robert!
So please help me understand the situation here a bit better.
What I observe is that—for each window, where the number of batches is set by --split
—racon (the python wrapper) loads all data into memory (appears to be single-threaded), and process the reads in that batch/window. Is that right?
Now since there's already an overlap file to begin with, would it help to use this overlap file in a way such that loading all data into memory is unnecessary, but only the reads that "map" to the current window?
Best, Steve
The --split
option will split the assembly into batches and then start polishing on each of them by invoking Racon in a sequential manner.
Indeed, for this use case it would be better to first load the overlap file and drop everything from the read file that is not needed, and thus decrease the memory consumption. I cannot remember why we implemented it the other way around.
I totally understand there could be delicate reasons for not doing so.
As I watch my job progress, another optimization that could be implemented—when GPU is available—is to start loading the next batch of sequences while the GPU is doing the polishing (not the alignment), to save sometime, as the loading is usually single-threaded, is IO-bound hence with much higher latency, and most CPU threads are idle while the GPU is working hard.
The complete sequence file is loaded at the beginning of the run, and usually should not take that much time. We will explore other options to see if we can reduce memory consumption on bigger genomes.
So this is what I observe, by running
./racon_wrapper \
-u \
-t 32 \
-c 4 \
--cudaaligner-batches 50 \
--split 18000000 \
${READS} ${OVP} ${DRAFT}
on a primate genome:
Using 2 GPU(s) to perform polishing
Initialize device 0
Initialize device 1
[CUDAPolisher] Constructed.
[racon::Polisher::initialize] loaded target sequences 0.382995 s
[racon::Polisher::initialize] loaded sequences 2165.672248 s
[racon::Polisher::initialize] loaded overlaps 46.699042 s
[racon::CUDAPolisher::initialize] allocated memory on GPUs for alignment 0.624735 s
[racon::CUDAPolisher::initialize] aligning overlaps [====================] 29.238104 s
[racon::Polisher::initialize] aligning overlaps [====================] 80.019571 s
[racon::Polisher::initialize] transformed data into windows 4.801252 s
[racon::CUDAPolisher::polish] allocated memory on GPUs for polishing 10.350098 s
[racon::CUDAPolisher::polish] generating consensus [====================] 63.771369 s
[racon::CUDAPolisher::polish] polished windows on GPU 73.660493 s
[racon::CUDAPolisher::polish] generated consensus 0.279268 s
[racon::Polisher::] total = 2628.970957 s
Using 2 GPU(s) to perform polishing
Initialize device 0
Initialize device 1
[CUDAPolisher] Constructed.
[racon::Polisher::initialize] loaded target sequences 0.859031 s
[racon::Polisher::initialize] loaded sequences 1996.871102 s
[racon::Polisher::initialize] loaded overlaps 45.387511 s
[racon::CUDAPolisher::initialize] allocated memory on GPUs for alignment 0.517121 s
[racon::CUDAPolisher::initialize] aligning overlaps [====================] 26.356452 s
[racon::Polisher::initialize] aligning overlaps [====================] 78.293230 s
[racon::Polisher::initialize] transformed data into windows 4.440666 s
[racon::CUDAPolisher::polish] allocated memory on GPUs for polishing 9.928194 s
[racon::CUDAPolisher::polish] generating consensus [====================] 59.604708 s
[racon::CUDAPolisher::polish] polished windows on GPU 69.994798 s
[racon::CUDAPolisher::polish] generated consensus 0.183795 s
[racon::Polisher::] total = 2462.638074 s
# the racon blocks continue
And each time the device/GPU are (re-)initialized, the memory used by racon drops to almost zero, and it seems to me that all reads are reloaded again, taking some considerable time.
Am I running things in a bad manner?
And I've attached the monitoring over the last 12 hours below (the time stamp on the top right is noise). It looks like IO has periodic peaks (reading), indicating reloading of the reads.
Unfortunately, it was designed that way. Any reason why you use 18Mbp as the split size?
Ah, I see.
I was just playing with the parameters, as I wasn't quite sure exactly what the parameter --split
means. What I observed is that for my data, the loading uses ~141GB memory, then inbetween the two GPU alignment and polishing steps, racon uses the desired number of threads while memory peak to about 148GB.
So to make sure I understand, at the beginning of each batch (batch size will be total amount of reads divided by the --split
size), all reads are loaded, then the batch will be processed by GPU/CPU. The batch size (inversely related to --split
) will directly affect the memory "overhead" on top of holding all reads in memory.
Is that right?
Indeed. I usually have set the --split
parameter a bit larger than the longest contig.
Thanks Robert! That tip is super helpful.
You will still get a downgrade in speed, because in each batch the reads are parsed anew :/
That I understand, and now that you are aware of it, I expect improvements coming, maybe not soon but sometime.
Hopefully soon :D
This is related to #24 .
My hands are tied in the following sense, when polishing assembly of large genomes with deep coverage data:
Hence I am wondering if it is possible for racon to expose CLI parameters that permits jobs to be run in stages. This way, uses can then configure VM of different specifications for different stages and resume work.
I know this might be a big request, but it would make our lives easier.
Thanks!
Steve