Closed fbemm closed 5 years ago
Hi Felix, unfortunately yes, that is the expected behaviour (the whole read set is stored in memory and if you have ~120x FASTQ that is around ~130Gb). You can speed alignment parsing up by using MHAP format or replacing read headers with numeric identifiers, which will drastically decrease the PAF file. Also, if you by any chance have CIGAR strings in PAF, racon will not use those so you can remove them as well.
Best regards, Robert
Hi Robert,
I guess one could split the PAF file per target and then run racon chunked. I will try to convert PAF to MHAP and also look into shortening the read IDs. Thanks for pointing this out. Do you know by chance if the NVIDIA Clara Genomics implementation of racon is running through the same bottleneck?
All the bests, Felix
I did not have the chance to see the source code yet, but most probably they did not touch the parser. I think that on larger inputs the parsing is not that big of a problem, for now :)
Although, PAF parsing step includes multithreaded alignment (totally forgot that). Did you run racon without option -t <threads>
(1 thread is used by default)? Was the 1 thread active all the time while racon was parsing the PAF file?
I supplied 32 threads but see a CPU cap at 100% (single CPU) all the time. I am copy the PAF file to a fast storage device now but I don't think that this is the issue.
My bad again, time printed for parsing is separated from alignment step. I'll try to parse something that big locally.
Ok I actually got things wrong. Sequence loading is the problem.
[racon::Polisher::initialize] loaded target sequences 3.730 s
[racon::Polisher::initialize] loaded sequences 100776.275 s
[racon::Polisher::initialize] loaded overlaps 20.544 s
Is the sequence file gziped?
Yes, that might be the issues. Stupid me. Way faster if FASTQ is uncompressed. Maybe add that to the README.md for people like me ;)
Will do:)
I am running racon on 120x of mapping data. The target is 650Mb. The PAF input file is around 1Gb (~7.2M alignments, 3.2M reads). Target sequences are loaded within seconds (4s) but loading the alignments takes a long time (1232s), utilizes quite some memory (128Gb) and runs single threaded. Is that an expected behaviour? Reads have an N50 of about 45kb. I tried the wrapper but that does not speed up things when chunking, guess because I am still in the alignment loading stage. Is there any way to speed up the alignment loading?