isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
271 stars 49 forks source link

Performance questions #123

Closed fbemm closed 5 years ago

fbemm commented 5 years ago

I am running racon on 120x of mapping data. The target is 650Mb. The PAF input file is around 1Gb (~7.2M alignments, 3.2M reads). Target sequences are loaded within seconds (4s) but loading the alignments takes a long time (1232s), utilizes quite some memory (128Gb) and runs single threaded. Is that an expected behaviour? Reads have an N50 of about 45kb. I tried the wrapper but that does not speed up things when chunking, guess because I am still in the alignment loading stage. Is there any way to speed up the alignment loading?

rvaser commented 5 years ago

Hi Felix, unfortunately yes, that is the expected behaviour (the whole read set is stored in memory and if you have ~120x FASTQ that is around ~130Gb). You can speed alignment parsing up by using MHAP format or replacing read headers with numeric identifiers, which will drastically decrease the PAF file. Also, if you by any chance have CIGAR strings in PAF, racon will not use those so you can remove them as well.

Best regards, Robert

fbemm commented 5 years ago

Hi Robert,

I guess one could split the PAF file per target and then run racon chunked. I will try to convert PAF to MHAP and also look into shortening the read IDs. Thanks for pointing this out. Do you know by chance if the NVIDIA Clara Genomics implementation of racon is running through the same bottleneck?

All the bests, Felix

rvaser commented 5 years ago

I did not have the chance to see the source code yet, but most probably they did not touch the parser. I think that on larger inputs the parsing is not that big of a problem, for now :)

Although, PAF parsing step includes multithreaded alignment (totally forgot that). Did you run racon without option -t <threads> (1 thread is used by default)? Was the 1 thread active all the time while racon was parsing the PAF file?

fbemm commented 5 years ago

I supplied 32 threads but see a CPU cap at 100% (single CPU) all the time. I am copy the PAF file to a fast storage device now but I don't think that this is the issue.

rvaser commented 5 years ago

My bad again, time printed for parsing is separated from alignment step. I'll try to parse something that big locally.

fbemm commented 5 years ago

Ok I actually got things wrong. Sequence loading is the problem.

[racon::Polisher::initialize] loaded target sequences 3.730 s
[racon::Polisher::initialize] loaded sequences 100776.275 s
[racon::Polisher::initialize] loaded overlaps 20.544 s
rvaser commented 5 years ago

Is the sequence file gziped?

fbemm commented 5 years ago

Yes, that might be the issues. Stupid me. Way faster if FASTQ is uncompressed. Maybe add that to the README.md for people like me ;)

rvaser commented 5 years ago

Will do:)