Feature request: is it possible to divide the work into stages with appropriate CLI

SHuang-Broad commented 4 years ago

This is related to #24 .

My hands are tied in the following sense, when polishing assembly of large genomes with deep coverage data:

I want to make use of the GPU acceleration
Using GPU limits my memory allocation for my VM (cloud vendor restriction)
racon tends to load all sequences into memory for preprocessing, potentially demanding a lot of memory (depending on genome size and coverage)

Hence I am wondering if it is possible for racon to expose CLI parameters that permits jobs to be run in stages. This way, uses can then configure VM of different specifications for different stages and resume work.

I know this might be a big request, but it would make our lives easier.

Thanks!

Steve

rvaser commented 4 years ago

Hi Steve, this will be a bit of a hassle to implement, because the read file is in memory during the whole run. Windows which are created only contain pointers to the sequences, so we do not copy the data unnecessary. I guess we could store windows which contain actual sequences to the disc, and then use a different subroutine which will do the multiple sequence alignment. I will have to think about the best way to do this, and I can not guarantee you when this request will be implemented.

Best regards, Robert

SHuang-Broad commented 4 years ago

Thanks Robert!

So please help me understand the situation here a bit better. What I observe is that—for each window, where the number of batches is set by --split—racon (the python wrapper) loads all data into memory (appears to be single-threaded), and process the reads in that batch/window. Is that right? Now since there's already an overlap file to begin with, would it help to use this overlap file in a way such that loading all data into memory is unnecessary, but only the reads that "map" to the current window?

Best, Steve

rvaser commented 4 years ago

The --split option will split the assembly into batches and then start polishing on each of them by invoking Racon in a sequential manner.

Indeed, for this use case it would be better to first load the overlap file and drop everything from the read file that is not needed, and thus decrease the memory consumption. I cannot remember why we implemented it the other way around.

SHuang-Broad commented 4 years ago

I totally understand there could be delicate reasons for not doing so.

SHuang-Broad commented 4 years ago

As I watch my job progress, another optimization that could be implemented—when GPU is available—is to start loading the next batch of sequences while the GPU is doing the polishing (not the alignment), to save sometime, as the loading is usually single-threaded, is IO-bound hence with much higher latency, and most CPU threads are idle while the GPU is working hard.

rvaser commented 4 years ago

The complete sequence file is loaded at the beginning of the run, and usually should not take that much time. We will explore other options to see if we can reduce memory consumption on bigger genomes.

SHuang-Broad commented 4 years ago

So this is what I observe, by running

./racon_wrapper \
    -u \
    -t 32 \
    -c 4 \
    --cudaaligner-batches 50 \
    --split 18000000 \
    ${READS} ${OVP} ${DRAFT}

on a primate genome:

Using 2 GPU(s) to perform polishing
Initialize device 0
Initialize device 1
[CUDAPolisher] Constructed.
[racon::Polisher::initialize] loaded target sequences 0.382995 s
[racon::Polisher::initialize] loaded sequences 2165.672248 s
[racon::Polisher::initialize] loaded overlaps 46.699042 s
[racon::CUDAPolisher::initialize] allocated memory on GPUs for alignment 0.624735 s
[racon::CUDAPolisher::initialize] aligning overlaps [====================] 29.238104 s
[racon::Polisher::initialize] aligning overlaps [====================] 80.019571 s
[racon::Polisher::initialize] transformed data into windows 4.801252 s
[racon::CUDAPolisher::polish] allocated memory on GPUs for polishing 10.350098 s
[racon::CUDAPolisher::polish] generating consensus [====================] 63.771369 s
[racon::CUDAPolisher::polish] polished windows on GPU 73.660493 s
[racon::CUDAPolisher::polish] generated consensus 0.279268 s
[racon::Polisher::] total = 2628.970957 s
Using 2 GPU(s) to perform polishing
Initialize device 0
Initialize device 1
[CUDAPolisher] Constructed.
[racon::Polisher::initialize] loaded target sequences 0.859031 s
[racon::Polisher::initialize] loaded sequences 1996.871102 s
[racon::Polisher::initialize] loaded overlaps 45.387511 s
[racon::CUDAPolisher::initialize] allocated memory on GPUs for alignment 0.517121 s
[racon::CUDAPolisher::initialize] aligning overlaps [====================] 26.356452 s
[racon::Polisher::initialize] aligning overlaps [====================] 78.293230 s
[racon::Polisher::initialize] transformed data into windows 4.440666 s
[racon::CUDAPolisher::polish] allocated memory on GPUs for polishing 9.928194 s
[racon::CUDAPolisher::polish] generating consensus [====================] 59.604708 s
[racon::CUDAPolisher::polish] polished windows on GPU 69.994798 s
[racon::CUDAPolisher::polish] generated consensus 0.183795 s
[racon::Polisher::] total = 2462.638074 s
# the racon blocks continue

And each time the device/GPU are (re-)initialized, the memory used by racon drops to almost zero, and it seems to me that all reads are reloaded again, taking some considerable time.

Am I running things in a bad manner?

SHuang-Broad commented 4 years ago

And I've attached the monitoring over the last 12 hours below (the time stamp on the top right is noise). It looks like IO has periodic peaks (reading), indicating reloading of the reads.

Screen Shot 2020-03-09 at 11 07 34 AM

rvaser commented 4 years ago

Unfortunately, it was designed that way. Any reason why you use 18Mbp as the split size?

SHuang-Broad commented 4 years ago

Ah, I see.

I was just playing with the parameters, as I wasn't quite sure exactly what the parameter --split means. What I observed is that for my data, the loading uses ~141GB memory, then inbetween the two GPU alignment and polishing steps, racon uses the desired number of threads while memory peak to about 148GB.

So to make sure I understand, at the beginning of each batch (batch size will be total amount of reads divided by the --split size), all reads are loaded, then the batch will be processed by GPU/CPU. The batch size (inversely related to --split) will directly affect the memory "overhead" on top of holding all reads in memory. Is that right?

rvaser commented 4 years ago

Indeed. I usually have set the --split parameter a bit larger than the longest contig.

SHuang-Broad commented 4 years ago

Thanks Robert! That tip is super helpful.

rvaser commented 4 years ago

You will still get a downgrade in speed, because in each batch the reads are parsed anew :/

SHuang-Broad commented 4 years ago

That I understand, and now that you are aware of it, I expect improvements coming, maybe not soon but sometime.

rvaser commented 4 years ago

Hopefully soon :D

lbcb-sci / racon

Feature request: is it possible to divide the work into stages with appropriate CLI #26