nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

Samtools memory error HPC #394

Closed steinbrl closed 1 year ago

steinbrl commented 1 year ago

Hi,

I try to migrate my pipeline to a HPC/SLURM cluster, but the medaka polishing step produces a error.

"samtools sort: couldn't allocate memory for bam_mem Alignment pipeline failed. Failed to run alignment of reads to draft."

The samtools problem occurs only inside medaka. If I call a samtools view | samtools sort pipe manually, it works flawless, no memory issues. There are more than enough ressources on the HPC, I ran the script with 64 cores and 256GB of memory. The Long Read file have only 870MB, so, it is impossible, that there is not enough RAM. It seemed to be, that the problem is in the internal samtools call from medaka. The script runs perfectly on a Workstation (2xXeon, 40 Threads, 128GB RAM, Ubuntu), with less memory. Do yu have any ideas?

Greetings,

Lars Steinbrück

cjw85 commented 1 year ago

Hi Lars,

I'm not sure what's going on here. The medaka_consensus script was only really ever intended as a convenience and to demonstrate the various steps to use medaka. If you are putting medaka into a large bioinformatics workflow I would suggest studying the script and putting each of its steps as a discrete task in your pipeline.

steinbrl commented 1 year ago

Hi,

I use not the full medaka pipeline, only the consensus step, as a last polishing step after 4 rounds of racon. It is a discrete call...

medaka_consensus -i "$sample"_filtered_long.fastq -d racon_polish4.fasta -o medaka -t $cpu

This produces the error...

steinbrl commented 1 year ago

ok, the problem was the --threads argument. I tried it with 2 and 4, that works. This is sad, if you have a lot of computation power to use...

cjw85 commented 1 year ago

As stated above, if you are embedding medaka in a larger pipeline you should study the medaka_consensus script and extract the distinct steps into your own pipeline.

cjw85 commented 1 year ago

a last polishing step after 4 rounds of racon

There is no need to run racon before medaka.

steinbrl commented 1 year ago

I found this strategy in a paper about hybrid assambly of bacterial genomes. there they benchmarked several workflows and assembly/polishing tools.

http://doi.org/10.1186/s12864-021-07767-z

cjw85 commented 1 year ago

Running racon four times has not be our recommended approach for a number of years. The original "4 rounds of racon" approach is derived from it being the procedure used to train some of the early medaka models. The inference models in Medaka are now trained to correct the direct output from Flye.

Our recommended approach is therefore to use the most recent version of the Guppy basecaller, assemble with Flye, and then run medaka.