Stereo Duplex sloooooow

adbeggs commented 1 year ago

Hi all

Running the Stereo pipeline on both a V100 (our P24, fully updated) and a HPC A30 node.. both are considerably slower than the Guppy Duplex pipeline... any suggestions? Ironically the A30 at full tilt seems slower than the V100

From the P24:

/data/software/dorado/bin/dorado duplex "/data/software/dorado/models/dna_r10.4.1_e8.2_400bps_sup@v4.0.0" pod5/ --pairs pairs_from_bam/pair_ids_filtered.txt | samtools view -b > duplex_dorado.bam

From our HPC:

#SBATCH --gres gpu:a30:1
#SBATCH --time 7-0:0:0
#SBATCH --tasks 20
module purge
module load bluebear
module load bear-apps/2021b
module load CUDA/11.4.1
module load SAMtools/1.15.1-GCC-11.2.0
export LD_LIBRARY_PATH=/rds/projects/b/beggsa-clinicalnanopore/software/dorado/lib:$LD_LIBRARY_PATH
/rds/projects/b/beggsa-clinicalnanopore/software/dorado/bin/dorado duplex /rds/projects/b/beggsa-clinicalnanopore/software/dorado/models/dna_r10.4.1_e8.2_400bps_sup@v4.0.0 pod5/ --pairs pairs_from_bam/pair_ids_filtered.txt | samtools view -h > duplexcalls.bam

Many thanks in advance!

Andrew

adbeggs commented 1 year ago

PS At the current rate it is going on the V100 it won't finish for 90 days! Guppy would usually take 4-5 days depending on the volume of data

vellamike commented 1 year ago

Hi Andrew, that seems odd, a few questions:

How much available RAM is there on the system?
What duplex pairing rates are you observing?

There is an edge case where Stereo will run slowly if follow on rates are low, especially if you run out of RAM. I suspect this is what you are encountering. It's something we will fix early in the new year.

adbeggs commented 1 year ago

HI MIke

The nodes have 500GB of system RAM but weren't being given the entire node, I have set it to giving it the entire node but still very very slow, in fact on our HPC dorado initiates but doesn't run - I might recompile from source to see if that makes any difference. Output is here:

CUDA/11.4.1
GCCcore/11.2.0
zlib/1.2.11-GCCcore-11.2.0
binutils/2.37-GCCcore-11.2.0
GCC/11.2.0
ncurses/6.2-GCCcore-11.2.0
zlib/1.2.11-GCCcore-11.2.0
bzip2/1.0.8-GCCcore-11.2.0
XZ/5.2.5-GCCcore-11.2.0
OpenSSL/1.1
cURL/7.78.0-GCCcore-11.2.0
SAMtools/1.15.1-GCC-11.2.0
[2022-12-28 14:14:22.917] [info] > Loading pairs file
[2022-12-28 14:14:22.939] [info] > Pairs file loaded
[2022-12-28 14:14:25.542] [warning] > warning: auto batchsize detection failed
[2022-12-28 14:14:27.389] [info] > Starting Stereo Duplex pipeline

It just sits there for hours and hours not doing anything. Duplex pairing rates on this library are 60%.

BW

Andrew

vellamike commented 1 year ago

That doesn’t match my theory, could you confirm if simplex calling works on this node? Could you also check if you are running out of RAM and falling back to swap memory?

On Wed, 28 Dec 2022 at 14:16, Andrew Beggs @.***> wrote:

HI MIke

The nodes have 500GB of system RAM but weren't being given the entire node, I have set it to giving it the entire node but still very very slow, in fact on our HPC dorado initiates but doesn't run - I might recompile from source to see if that makes any difference. Output is here:

CUDA/11.4.1 GCCcore/11.2.0 zlib/1.2.11-GCCcore-11.2.0 binutils/2.37-GCCcore-11.2.0 GCC/11.2.0 ncurses/6.2-GCCcore-11.2.0 zlib/1.2.11-GCCcore-11.2.0 bzip2/1.0.8-GCCcore-11.2.0 XZ/5.2.5-GCCcore-11.2.0 OpenSSL/1.1 cURL/7.78.0-GCCcore-11.2.0 SAMtools/1.15.1-GCC-11.2.0 [2022-12-28 14:14:22.917] [info] > Loading pairs file [2022-12-28 14:14:22.939] [info] > Pairs file loaded [2022-12-28 14:14:25.542] [warning] > warning: auto batchsize detection failed [2022-12-28 14:14:27.389] [info] > Starting Stereo Duplex pipeline

It just sits there for hours and hours not doing anything. Duplex pairing rates on this library are 60%.

BW

Andrew

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/68#issuecomment-1366685191, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYB7JCXJIQMGZZUMGSIALWPRDTRANCNFSM6AAAAAATLHC4SE . You are receiving this because you commented.Message ID: @.***>

vellamike commented 1 year ago

Ps how many CPU cores are available to the job on this node?

On Wed, 28 Dec 2022 at 14:26, Mike Vella @.***> wrote:

That doesn’t match my theory, could you confirm if simplex calling works on this node? Could you also check if you are running out of RAM and falling back to swap memory?

On Wed, 28 Dec 2022 at 14:16, Andrew Beggs @.***> wrote:

HI MIke

The nodes have 500GB of system RAM but weren't being given the entire node, I have set it to giving it the entire node but still very very slow, in fact on our HPC dorado initiates but doesn't run - I might recompile from source to see if that makes any difference. Output is here:

CUDA/11.4.1 GCCcore/11.2.0 zlib/1.2.11-GCCcore-11.2.0 binutils/2.37-GCCcore-11.2.0 GCC/11.2.0 ncurses/6.2-GCCcore-11.2.0 zlib/1.2.11-GCCcore-11.2.0 bzip2/1.0.8-GCCcore-11.2.0 XZ/5.2.5-GCCcore-11.2.0 OpenSSL/1.1 cURL/7.78.0-GCCcore-11.2.0 SAMtools/1.15.1-GCC-11.2.0 [2022-12-28 14:14:22.917] [info] > Loading pairs file [2022-12-28 14:14:22.939] [info] > Pairs file loaded [2022-12-28 14:14:25.542] [warning] > warning: auto batchsize detection failed [2022-12-28 14:14:27.389] [info] > Starting Stereo Duplex pipeline

It just sits there for hours and hours not doing anything. Duplex pairing rates on this library are 60%.

BW

Andrew

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/68#issuecomment-1366685191, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYB7JCXJIQMGZZUMGSIALWPRDTRANCNFSM6AAAAAATLHC4SE . You are receiving this because you commented.Message ID: @.***>

adbeggs commented 1 year ago

Hi Mike

Yes, simplex calling is working fine, calling very quickly as expected. There are 20 cores available on this node (it's an Icelake one). When I run it memory usage peaks at only 5G:

| Requested cpu=20,mem=400G,node=1,billing=20,gres/gpu=1 - 7-00:00:00 walltime
| Assigned to nodes bear-pg0103u14a
| Command /rds/projects/b/beggsa-clinicalnanopore/adb/NA12878/20221212_1633_3E_PAM86221_1ab2d60f/rundorado.slurm
| WorkDir /rds/projects/b/beggsa-clinicalnanopore/adb/NA12878/20221212_1633_3E_PAM86221_1ab2d60f
+--------------------------------------------------------------------------+
+--------------------------------------------------------------------------+
| Finished at Wed Dec 28 14:35:15 2022 for beggsa(8152) on the BlueBEAR Cluster
| Required (00:13.314 cputime, 5017850K memory used) - 00:01:29 walltime
| JobState COMPLETING - Reason None
| Exitcode 0:15
+--------------------------------------------------------------------------+

I terminated the job as it isn't doing anything...

adbeggs commented 1 year ago

Even on the P24 it is painfully slow, it has been running for 2 hours and has only managed to process 7200 reads!

adbeggs commented 1 year ago

Only thing I can think of is I am running it on a single, very large pod5 file (1100GB) - would that make a difference - it doesn't seem to for simplex.

vellamike commented 1 year ago

Ah, a very large pod5 is a relatively untested case and I can see several ways it would cause poor performance - luckily all quite fixable.

I will keep this issue open and get a fix to you in early Jan.

In the meantime could you demux the pod5 into smaller ones by channel ID and run stereo independently for each? This should be a best case scenario for performance with the present implementation.

On Wed, 28 Dec 2022 at 14:41, Andrew Beggs @.***> wrote:

Only thing I can think of is I am running it on a single, very large pod5 file (1100GB) - would that make a difference - it doesn't seem to for simplex.

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/68#issuecomment-1366701066, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYB7OGWKNKOW43VYQKWSDWPRGRPANCNFSM6AAAAAATLHC4SE . You are receiving this because you commented.Message ID: @.***>

incoherentian commented 1 year ago

SBATCH --tasks 20

I can't explain the V100, but I think this SBATCH parameter is going to try loading 20x instances of Dorado, all of them trying to access the entire A30. What happens when you change this to the following?

SBATCH --tasks 1

SBATCH --cpus-per-task=20

incoherentian commented 1 year ago

What I was actually thinking was

SBATCH --ntasks=1

SBATCH --cpus-per-task=20

adbeggs commented 1 year ago

Update - it is a lot quicker with single POD5 files... interesting!

dithiii commented 1 year ago

Same issue here with slow stereo calling but it persists even when using multiple small pod5s. I'm using dorado 0.1.1, ubuntu 22.04. Simplex calling with the dna_r10.4.1_e8.2_400bps_fast@v4.0.0 model calls at 40,000 reads/s, but when I try stereo duplex, even with the "fast" model, it calls at 300 reads per minute.

Duplex tools claimed I had 18% duplex rate. Any fix?

Kirk3gaard commented 1 year ago

Hi

Would there be a speed benefit from using the sam file from the simplex super accuracy basecalling as input for dorado duplex calling (I think that was mentioned at NCM)? And if so how is that supplied?

I tried running dorado duplex dna_r10.4.1_e8.2_400bps_sup@v4.0.0 --pairs pairs_from_sam/pair_ids_filtered.txt sam_dir/ > duplex_orig.sam

However, it did not did find any reads and just completed with 0 reads basecalled.

dorado duplex -h Usage: dorado [-h] [--pairs VAR] [--emit-fastq] [--threads VAR] [--device VAR] [--batchsize VAR] [--chunksize VAR] [--overlap VAR] [--num_runners VAR] model reads

Positional arguments: model Model reads Reads in Pod5 format or BAM/SAM format for basespace.

vellamike commented 1 year ago

Hi @Kirk3gaard in sam_dir do you have pod5 files or a SAM file? Dorado Duplex calling requires the raw data in POD5 format, this is what reads in the help is referring to.

Kirk3gaard commented 1 year ago

Hi @vellamike so the help function suggesting "BAM/SAM format for basespace." is not an option for speeding things up? Or even a real option anymore? I was just wondering how I get to the "duplex for free" scenario mentioned in the NCM presentation (see below) when I have done simplex calling with super accuracy mode already. (The RTX 4090 card basecalled our best promethion run ~200 Gbp in 3 days with sup for simplex reads) Reference: https://youtu.be/8DVMG7FEBys

vellamike commented 1 year ago

Ah, that is a hidden method for the eagle-eyed :)

This is a method which is very fast but works in sequence-space only so is less accurate, please run it like so:

duplex basespace /path/to/bam.bam --pairs /path/to/pairs.txt

This method is experimental - feedback welcome!

Kirk3gaard commented 1 year ago

Sneaky. Thanks a lot! Okay so the recommended way of getting the most out of a sequencing run (and the GPUs) at the moment is to

basecall all the pod5s with fast for getting the pairs
sort pod5 by channel ID (someone wrote a script for that?)
then run duplex calling with the sup model on the pairs using sorted pod5s
run simplex calling on the remaining reads with sup

Looking forward to see a simplification of this process to output simplex and duplex with one command. I will give the basespace and pod5 based versions a try and see how long it takes.

vellamike commented 1 year ago

Hi @Kirk3gaard - yes, that is currently the best method. We are working on usability and performance improvements all the time and any feedback is very welcome.

vellamike commented 1 year ago

P.S sorting pod5 by channel ID is a "Nice to have" but not crucial.

adbeggs commented 1 year ago

Hi @vellamike still seeing this issue. Have single POD5 files, fast calling on dorado on our A30 completes at 3e07 samples/s but when call duplex it justs sits there saying "Starting stereo duplex pipeline".

I've checked and it has the whole a30 node available to it so shouldn't be running slowly. I am running it on Redhat but can't see anything specific that might be causing the issue

adbeggs commented 1 year ago

THe whole run is teeny tiny - only 200k reads but meant to be 40% duplex

vellamike commented 1 year ago

Can you show me the Duplex command you are running?

Also, is your pairs file tab or space delimited? It needs to be space delimited, could you check this?

Kirk3gaard commented 1 year ago

"basespace" mode tried to load the entire BAM file into RAM before starting and died when it ran out of RAM. Maybe worth enabling a smarter way to avoid the need for massive memory.

I assume that only the two reads in the pair are needed to perform duplex calling so it should be possible to load subsets of pairs without crashing. Enabling the use of fastq files as input might make it even more flexible for people to prepare subsets using existing tools in combination with the par id file.

vellamike commented 1 year ago

Hi @Kirk3gaard - that is indeed a problem with the current implementation of the Basespace method, especially for very large BAMs. Could split your BAM by channel ID into multiple BAMs and run duplex on each?

Kirk3gaard commented 1 year ago

Tried running duplex with the pod5 files rather than basespace and it crashed after generating a sam file of the same size every time I tried. I looked through the syslog and it apparently runs well for some time and then suddenly runs out of memory.

"Out of memory: killed process 50831 (dorado)" "oom_reaper: reaped process 50831 (dorado)"

I would assume that it should be possible to run stereo duplex calling on a machine with 96 GB RAM and 24 GB GPU RAM as the software should not need to load all of the pod5 data into memory at once or whatever is causing this. Any hint as to what could be causing this?

vellamike commented 1 year ago

Hi Rasmus, right now the host memory consumption is governed in a complicated way by a few parameters:

Number of reads
Read length
Pairing rate
POD5 ordering

We have an upcoming release soon which significantly reduces the memory requirement on the host side for duplex. In the meantime, one thing you could do is demultiplex your pod5 by channel into multiple pod5s and run stereo on each independently.

vellamike commented 1 year ago

Hi @Kirk3gaard @adbeggs @incoherentian @dithiii ,

Version 0.2.1 of Dorado introduces big speed and RAM utilisation improvements to Duplex calling - could you try this?

Kirk3gaard commented 1 year ago

Should we test whether it runs without splitting reads by channel?

vellamike commented 1 year ago

Yes please - memory consumption is down quite a bit and this should work fine now.

On Wed, 22 Feb 2023 at 19:58, Rasmus Kirkegaard @.***> wrote:

Should we test whether it runs without splitting reads by channel?

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/68#issuecomment-1440715842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYB7JATR6R35D5AEKJPP3WYZVV5ANCNFSM6AAAAAATLHC4SE . You are receiving this because you were assigned.Message ID: @.***>

adbeggs commented 1 year ago

I will try!

From: Mike Vella @.> Sent: 22 February 2023 20:00 To: nanoporetech/dorado @.> Cc: Andrew Beggs (Cancer and Genomic Sciences) @.>; Mention @.> Subject: Re: [nanoporetech/dorado] Stereo Duplex sloooooow (Issue #68)

Yes please - memory consumption is down quite a bit and this should work fine now.

On Wed, 22 Feb 2023 at 19:58, Rasmus Kirkegaard @.***> wrote:

Should we test whether it runs without splitting reads by channel?

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/68#issuecomment-1440715842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALYB7JATR6R35D5AEKJPP3WYZVV5ANCNFSM6AAAAAATLHC4SE . You are receiving this because you were assigned.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/nanoporetech/dorado/issues/68#issuecomment-1440717333, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC7KTDEA3KAJZXRRFINNAQDWYZV2VANCNFSM6AAAAAATLHC4SE. You are receiving this because you were mentioned.Message ID: @.***>

Kirk3gaard commented 1 year ago

It started nicely. Then processed 310600 reads before it got "Killed"

Commands used to run dorado and output:

MODELPATH="/home/ubuntu/Desktop/software/dorado-0.2.1-linux-x64/models"
MODEL="dna_r10.4.1_e8.2_400bps_sup@v4.1.0"
POD5DIR=pod5/

dorado duplex $MODELPATH/$MODEL --device "cuda:all" --min-qscore 25 --pairs pairs_from_sam/pair_ids_filtered.txt $POD5DIR/ > duplex_$MODEL.sam
[2023-02-23 15:52:55.097] [info] > Loading pairs file
[2023-02-23 15:52:55.400] [info] > Pairs file loaded
[2023-02-23 15:52:59.938] [info] > Starting Stereo Duplex pipeline
> Reads processed: 310600Killed

iiSeymour commented 1 year ago

Stereo performance improvements in https://github.com/nanoporetech/dorado/releases/tag/v0.2.2

nanoporetech / dorado

Stereo Duplex sloooooow #68

SBATCH --tasks 20

SBATCH --tasks 1

SBATCH --cpus-per-task=20

SBATCH --ntasks=1

SBATCH --cpus-per-task=20