nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
452 stars 53 forks source link

Basecalling slow? #90

Closed LauraVP1994 closed 1 year ago

LauraVP1994 commented 1 year ago

Hello!

I have sequenced 5 samples with barcoding on a R10 flow cell resulting in ~284 GB of fast5 data. I want to analyse this data using our GPU servers (see screenshot).

2023-02-10 14_47_49-vm-rstudio-02 - Remote Desktop Connection

The script that we have used (with slurm) is


#SBATCH -J basecall_sarscov2_R10
#SBATCH --output=sbatch-%x-%j.out # j: job id, x: job name
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1

# Regular simplex basecalling
srun dorado basecaller dna_r10.4.1_e8.2_260bps_sup@v4.0.0 fast5/ --device cuda:all --emit-moves > /simplex_basecalling/unmapped_reads_simplex.sam

# Identifying read pairs
srun duplex_tools pair --output_dir /pairs_from_bam /simplex_basecalling/unmapped_reads_simplex.sam

# Duplex calling for paired reads
srun dorado duplex dna_r10.4.1_e8.2_260bps_sup@v4.0.0 fast5/ --device cuda:all --pairs /pairs_from_bam/pair_ids_filtered.txt > /duplex_basecalling/unmapped_reads_duplex.sam

# Demultiplexing (can be done on simplex / or duplex reads)
srun guppy_barcoder --device cuda:all --compress_fastq --records_per_fastq 0 --input_path simplex_basecalling/ --save_path simplex_guppy_basecalling/ --trim_barcodes

srun guppy_barcoder --device cuda:all --compress_fastq --records_per_fastq 0 --input_path duplex_basecalling/ --save_path duplex_guppy_basecalling/ --trim_barcodes

However, after 24 hours of running we only have a sam file for the simplex basecalling of 4.986 GB (coming from a FAST5 folder of 282GB). In a previous run we had a sam file of 4.894 GB for 14 GB of FAST5, so I'm a bit worried that this will takes days/weeks to complete.

I'm thus hoping that someone can give me pointers how to make this faster without losing the quality of the superaccuracy basecalling. We want to look to low frequency variants so we need the highest quality possible?

Thank you!

iiSeymour commented 1 year ago

Hey @LauraVP1994 one immediate thing that jumps out is dorado is only running on one of the four GPUs here. Is this intentional? If not I think you want to increase --gres=gpu:1.

LauraVP1994 commented 1 year ago

Indeed I tried with one gpu, but that's because experienced bioinformaticians at my lab told me that if you eg. use GBs , the run will likely only go 20% faster. So then it seems better to run two jobs in parallel? However, their experience is mainly based on guppy, so I'm not sure if this also is the case for dorado?

Moreover, I don't know if the time spent converting fast5 to pod5 will help the dorado basecalling go faster?

iiSeymour commented 1 year ago

Better multi-GPU scaling is one of the benefits of dorado. Pod5 is highly recommended for maximum performance, yes.

incoherentian commented 1 year ago

So then it seems better to run two jobs in parallel?

I know this isn't your main inquiry, but just want to chime in here that I get far more than a 20% boost using guppy with a second GPU (albeit on P100s, which Dorado is explicitly not supporting... for me, this is an instant efficiency loss waiting in a long SLURM queue for more modern GPU compute nodes, instead of waiting a few seconds for P100 nodes... going to stop that tangential-tangential rant here though!).

Just make sure --ntasks is not greater than 1 (as you do in dorado) and guppy manages multiple GPUs fine. (--cpus-per-task=2 can also help as a single CPU can cause a brief bottleneck early on in runs; however I haven't tested this with Dorado, for testing I'd just throw ample cores at it to rule out interference from an early CPU bottleneck... yours set to 4 for dorado would def. be more than enough for guppy.)

Anyway, guppy does fine with multiple CUDA devices but dorado really does do better. Like doubling CUDA devices = 1.99x performance, better. That's nuts.

In case it helps with your SBATCH jobs, here's what I threw together to achieve 1.99x performance from 2x GPUs (compared to the same with --gres=gpu:1):

#!/bin/bash
#SBATCH -J dorado011        # Job name
#SBATCH -o o.dorado.%J           # Job output file
#SBATCH -e e.dorado.%J           # Job error file
#SBATCH --ntasks=1          # number of parallel processes (tasks)
#SBATCH --cpus-per-task=8     # physical processor cores per task
#SBATCH -p gpu_v100          # selected queue
#SBATCH --gres=gpu:2         # number of gpus per node
#SBATCH --time=TT:TT:TT      # time limit
#SBATCH --account=NNNNNNN    # project account code

#Load defaults
set -eu
module purge
module load system
module load anaconda/2022.10

#Load dorado
#Note that this will only work for project folder assignees (BIKE possibly sans Brad)
source activate
conda activate $SLURM_JOB_NAME
export PATH=${PATH}:$(readlink -f /home/NNNNNNN/dorado-0.1.1/bin)

#Enter flow cell and sample ID
FLOWCELL=NNNNNNN
BASECALLER=$SLURM_JOB_NAME
SAMPLE_ID=barcode
SAMPLE_RANGE_LOWER=NN
SAMPLE_RANGE_UPPER=NN

#Change this if you don't want SUP accuracy
#Nodes assigned an sbatch do not necessarily have wider web access for downloading models;
#If newer models exist, download new model on a login node (dorado download --model $MODEL)
MODEL=/home/NNNNNNN/dorado-0.1.1/dna_r9.4.1_e8_sup@v3.3

#Set directories
WDPATH=/scratch/$USER/$FLOWCELL/
cd $WDPATH

#run dorado
for i in $(seq -w $SAMPLE_RANGE_LOWER 1 $SAMPLE_RANGE_UPPER)
do
mkdir -p $WDPATH$SAMPLE_ID"$i"/$BASECALLER/$SLURM_JOBID/
dorado basecaller $MODEL --emit-fastq --device cuda:all  $WDPATH$SAMPLE_ID"$i" > $WDPATH$SAMPLE_ID"$i"/$BASECALLER/$SLURM_JOBID/$SAMPLE_ID"$i".fastq   || echo "dorado error in i=$i"
done
dorado011x2gpu.sbatch (END)

For SLURM it was a bit of a conda hack job since I can't be asking our ARCCA HPC guys responsible for the entire uni to add SLURM modules for every piece of software I want to test, and to help deal with the need to use pre-demultiplexed raw data, but SBATCH parameters are otherwise pretty similar to yours. Hope you get similar efficiency with the new version!