isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
268 stars 48 forks source link

racon_wrapper #111

Open mictadlo opened 5 years ago

mictadlo commented 5 years ago

Hi, I am running racon with Illumina pair-end reads as describe here. However, it needs around 1 Tb of memory. However, you have racon_wrapper and I just wonder how to determine the additional parameter for it:

    --split <int>
        split target sequences into chunks of desired size in bytes
    --subsample <int> <int>
        subsample sequences to desired coverage (2nd argument) given the
        reference length (1st argument)

Does racon_wrapper run the chunks in sequence or in parallel and how memory can be saved?

Additionally, I found a snakemake pipeline for Racon here.

Thank you in advance,

Michal

rvaser commented 5 years ago

Hi Michal, if you want to decrease memory you can either use --split <longest contig length> or --subsample <reference length> 50 (or lower coverage) or both. The chunks obtained by splitting are run in sequence.

Best regards, Robert

000generic commented 4 years ago

I'm having a similar issue: I'm on a machine with 1024 Gb RAM and 48 CPUs available - and input files that are

Reads fastq: 369,668,281,386 Bam: 573,985,197,116 Fasta: 1,649,234,789

= 945 Gb in size.

As I understand it, Racon memory requirements can be estimated as the sum of the size of the input files plus some overhead cost. Depending on the overhead cost, I'm guessing I would be under the RAM maximum by ~50 Gb...? However, things crashed due to memory limitations I think.

I then used racon_wrapper with --split set to 1.1*longest contig length to reduce memory requirements but still it crashed out - again due to memory limits I think. In both cases I watched the memory creep up until it crashed.

racon_wrapper illumina-paradoxus.fq minimap2-illumina_X_flye-ont-polished.sam flye-assembly_racon-ont-polish.fasta -t 45 --split 8483560 > flye-assembly_racon-illumina-polish.fasta

[RaconWrapper::run] preparing data with rampler [RaconWrapper::run] total number of splits: 294 [RaconWrapper::run] processing data with racon [racon::Polisher::initialize] loaded target sequences 0.029969 s terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

I can next try --subsample and was wondering how to estimate a reference length to use.

Or if you have any other suggestions.

Thank you !

rvaser commented 4 years ago

Hello Eric, the overhead of storing SGS sequences is high with respect to the sequence file, so the total amount of memory needed is 1.5 * sequence file + all other files. To decrease memory, you can use PAF format instead of SAM, and let Racon align the sequences on to go. On the other hand, you can subsample your dataset given the assembly size.

Best regards, Robert

000generic commented 4 years ago

Great - I will give the PAF format a try!

Regarding subsample - do you mean set the subsample reference length to the assembly length? So in my case to subsample at coverage 50 :

--subsample 1,649,234,789 50

Thank you!

rvaser commented 4 years ago

Yes, but leave out the commas: --subsample 1649234789 50.

000generic commented 4 years ago

Awesome - thanks again! Will try both - first PAF - and then if still needed, subsample.

000generic commented 4 years ago

Still no luck!

There is 1 Tb of RAM. My reads are 370 Gb, my fasta is 2.5 Gb, and my paf file is 211 Gb.

For minimap2, I mapped reads to a racon ont-polished assembly and supplied the Illumina pair-end reads as separate files. For racon, I cat'd the two files together.

I supplied the racon_wrapper --split flag with 1.1*longest read length.

I created a bash file that contains: racon_wrapper reads.fq minimap2-reads-x-ont-polished-fasta.paf assembly_racon-ont-polished.fasta -t 45 --split 8483561

I ran things as:

bash bash-file > assembly_racon-illumina-polished.fasta &

and got

[1] 15967 (base) ::racon-pilon: [RaconWrapper::run] preparing data with rampler [RaconWrapper::run] total number of splits: 294 [RaconWrapper::run] processing data with racon [racon::Polisher::initialize] loaded target sequences 0.031758 s terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

[1]+ Exit 1 bash bash-file > assembly_racon-illumina-polished.fasta

I wasn't watching past 45% (and growing) memory usage but I'm assuming the memory was used up and this lead to 'std::bad_alloc'

Any idea what is going on? I'll try again with --subsample but was hoping to avoid it, as it seems like you suggest elsewhere it may cause things to underperform.

000generic commented 4 years ago

It worked using paf and --subsample!

I ran subsample with coverage 50 - does this seem reasonable / do you have a sense what works well in general? Also, do you feel like it generally produces suboptimal polishing? Would it be a good idea to do 2 rounds of polishing when using --subsample? Or worth trying to continue working things out for --split?

Thank you!

rvaser commented 4 years ago

Can you please run head -n 1 <first.fastq> <second.fastq>? I want to see if everything worked as intended. Also, by 1.1*longest read length you mean longest contig length? I guess that one iteration should suffice, but you can check BUSCO scores and maybe run a second iteration (if it does not take too much time) and see if it helps.

000generic commented 4 years ago

Here is head on the read files sent to minimap2:

(base) ::racon-pilon: head -n 1 ../../../reads/illumina-paradoxus-1.fq @FCD05W8ACXX:6:1101:1703:1995#CGGGAGGT/1 (base) ::racon-pilon: head -n 1 ../../../reads/illumina-paradoxus-2.fq @FCD05W8ACXX:6:1101:1703:1995#CGGGAGGT/2

My mistake - I meant 1.1*longest contig - not read!

I'll run a second round of polishing and then BUSCO everyone and see how things look.

Thank you!

rvaser commented 4 years ago

Everything looks fine. The only thing that bothers me if random subsampling of short reads will work as good as for long reads.

000generic commented 4 years ago

I wonder about the effects of subsampling also - but I still haven't been able to sort out why --split isn't reducing what I think is the RAM issue despite the data being within the predicted RAM limits when I use PAF.

rvaser commented 4 years ago

Not sure either. We will probaly have to overhaul Racon and perhaps reimplement some parts.