SBurnard commented 2 months ago

Thank you for creating the nci-scripts to enable buttery-eel execution.

While I was testing and adapting these scripts I found that the use '--guppy_batchsize' breaks the execution when using the newest version '/g/data/if89/apps/modulefiles/buttery-eel/0.5.1+dorado7.4.12'

The example scripts you've shared (links below) use '0.4.2+dorado7.2.13' and include the use of '--guppy_batchsize' which either is utilised, or is ignored. Firstly, is this flag necessary as the command line also contains '--slow5_batchsize 100', as I can't see 'guppy_batchsize' within the buttery-eel github documentation (it only mentions 'slow5_batchsize')? Secondly, if this flag is no longer supported by the latest buttery-eel version, it might be worth removing it so others don't run into this issue when they try a different version (and adding it to the 'breaking changes' within https://github.com/Psy-Fer/buttery-eel).

For reference, when I include the use of '--guppy_batchsize 2000' using the latest buttery-eel '0.5.1+dorado7.4.12', I get the following error: ############################################################## Loading /g/data/if89/apps/modulefiles/buttery-eel/0.5.1+dorado7.4.12 Loading requirement: intel-mkl/2021.4.0 python3/3.10.4 trying port 5000 Using port 5000 : Dorado Basecall Service Software, (C)Oxford Nanopore Technologies plc. Version 7.4.12+0e5e75c49, client-server API version 20.0.0

Use of this software is permitted solely under the terms of the end user license agreement (EULA). By running, copying or accessing this software, you are demonstrating your acceptance of the EULA. The EULA may be found in /g/data/if89/apps/buttery-eel/0.5.1+dorado7.4.12/ont-dorado-server/bin Unexpected token '--guppy_batchsize' on command-line. Unexpected token '20000' on command-line. Try 'dorado_basecall_server --help' for more information. ##############################################################

Scripts referred to: https://github.com/hasindu2008/nci-scripts/blob/main/modcall/buttery-eel-dorado.pbs.sh /usr/bin/time -v buttery-eel -i ${MERGED_SLOW5} -o ${BASECALL_OUT}/reads.sam --guppy_bin ${ONT_DORADO_PATH} --port ${PORT} --use_tcp --config ${MODEL} -x cuda:all --guppy_batchsize 20000 --max_queued_reads 20000 --slow5_threads 10 --slow5_batchsize 100 --procs 20 --call_mods || die "basecalling failed" https://github.com/hasindu2008/nci-scripts/blob/main/basecall/buttery-eel-dorado.pbs.sh /usr/bin/time -v buttery-eel -i ${MERGED_SLOW5} -o ${BASECALL_OUT}/reads.fastq --guppy_bin ${ONT_DORADO_PATH} --port ${PORT} --use_tcp --config ${MODEL} -x cuda:all --guppy_batchsize 20000 --max_queued_reads 20000 --slow5_threads 10 --slow5_batchsize 100 --procs 20 || die "basecalling failed"

hasindu2008 commented 2 months ago

@SBurnard

I am glad to see that you found this useful.

That --guppy_batchsize is a parameter that was used in some older Guppy versions to control some batch size inside the Guppy server. This parameter was directly passed on to the Guppy server and was useful in performance if I understand right. So this parameter got propagated from old scripts. For Dorado-server it is possible that this parameter was never used. In the latest ont-dorado-server, I guess ONT did a lot of breaking changes to the API and they could have just removed this parameter. @Psy-Fer any comments on this?

I will soon create a script that updates to the latest version of the buttery-eel.

Psy-Fer commented 2 months ago

Hey,

yea that option wasn't used in later version of guppy even when used. But dorado server removed the option.

Please update the script. You can probably remove --max_queued_reads 20000 too, as that is the default anyway.

James

SBurnard commented 2 months ago

Hi @hasindu2008 and James,

Thanks for your prompt responses. I had another comment and question regarding the NCI scripts.

Firstly, for changing the appropriate model and config file, it would be beneficial if there was a description of where and how to identify the appropriate option.

It took me a little while to realise that these could be found in: 'cd /g/data/if89/apps/buttery-eel/0.5.1+dorado7.4.12/ont-dorado-server/data'. And config names don't exactly match the naming convention of models e.g.
to use the model for modbase calls from dorado it's '5mCG_5hmCG' and the config files used on NCI are '5hmc_5mc_cg'. e.g. 'dna_r10.4.1_e8.2_400bpssup@v4.3.05mCG_5hmCG@v1' it requires the config 'dna_r10.4.1_e8.2_400bps_5khzmodbases5hmc_5mc_cg_sup.cfg' For someone who has been using and following the ONT information on dorado (which doesn't directly use config files), the altered naming convention for the model was slightly confusing.

Secondly, does the use of use of the new dorado servers support alignment as well as basecalling? Dorado and slow5-dorado are capable of performing both basecalling and alignment with minimap2 within the same script. Meanwhile, buttery-eel only mentions unaligned sam files as output. I appreciate these can be aligned and converted to bam after. However, I can do this in one command with slow5-dorado but this is obviously limited to using a very old version of dorado. Is there a reason the full utility of dorado isn't available via buttery-eel? (Sorry if I've missed an obvious answer/reason for this.)

hasindu2008 commented 2 months ago

@SBurnard I just pushed an update to the scripts so they reflect the latest version. Could you check if it works now? Also I moved the buttery-conversation to here https://github.com/Psy-Fer/buttery-eel/issues/52 and please feel to put your thoughts and comments there.

hasindu2008 commented 2 months ago

If you still face any issues with this updated script, please feel free to reopen. About the models, let us discuss on the buttery-eel repo. Thanks.

SBurnard commented 1 month ago

Hi @hasindu2008,

I've had a look at the updated NCI scripts and the use of the latest 'buttery-eel/0.5.1+dorado7.4.12' should work the flags you've kept in.

However, I think it might be worth changing a couple of the parameters (particularly the threads) to make use of the NCI resources that are already requested. This would also help with clarity of how it can be further optimised for user needs on the NCI.

The current script requests:

PBS -l ncpus=48

PBS -l ngpus=4

PBS -l mem=384GB

Focusing on the threads, the script currently uses --slow5_threads 10 --procs 20. I had a look a the buttery-eel document, and I couldn't see how the total number of available threads/CPUS is split between the --slow5_threads for file reading and the --procs ( worker threads for processing). Do these need to share the total number of available threads/cores?

If they need an even number of threads/cores on NCI, they could be accessed like so:

num_threads=${PBS_NCPUS} slow5_read_threads=$(($num_threads/2)) slow5_worker_threads=$(($num_threads/2))

OR if they need 1:3 split:

num_threads=${PBS_NCPUS} slow5_read_threads=$(($num_threads/4)) slow5_worker_threads=$(($num_threads/4*3))

Psy-Fer commented 1 month ago

Hey,

If you have a look at https://github.com/Psy-Fer/buttery-eel/blob/main/docs/thread_model.md You'll see a breakdown of how the thread model works.

A small amendment to the above proposed way of maximising the thread utilisation would be like so:

num_threads=${PBS_NCPUS}-5
slow5_read_threads=$(($num_threads/2))
slow5_worker_threads=$(($num_threads/2))

where the -5 is to account for the following

1 for reading. The slow5_threads are decompression threads, there is still 1 thread handling the actual data and pushing it into the queue.
1 for writing. This pulls from the output queue and writes the data to the appropriate files
1 for running buttery-eel
1-2 for guppy/dorado interprocess communication and operation of the server.
(1) for a dummy client that makes sure everything is actually working (though doesn't really use much, so I guess could be ignored)

Hope that helps.

James

SBurnard commented 1 month ago

The reason I was prompted to explore this was because tried a run a full blow5 file, which was not going to complete within 48 hours. It produced a sam file ~130Gb after 16 hours (~1Million reads processed). Meanwhile after optimising the threads and batchsize requested (and not requesting additional NCI rss), my latest run has produced a sam file ~200Gb within 7-8 hours. In fact, it just finished after 23 hours, processing 6,560,707 reads.

I tested four different runs initially with a 1.8Gb blow5 file using: i) --slow5_threads 10 --procs 20 --slow5_batchsize 100 ii) --slow5_threads 48 --procs 20 --slow5_batchsize 100 iii) --slow5_threads 10 --procs 20 --slow5_batchsize 4000 iv) --slow5_threads 48 --procs 20 --slow5_batchsize 4000

logs: i) CPU Time Used: 00:42:23 Memory Used: 17.98GB GPU Utilisation: 149% GPU Memory Used: 92.32GB Walltime Used: 00:04:07 ii) CPU Time Used: 00:41:56 Memory Used: 18.49GB GPU Utilisation: 183% GPU Memory Used: 99.68GB Walltime Used: 00:03:54 iii) CPU Time Used: 00:22:01 Memory Used: 23.37GB GPU Utilisation: 132% GPU Memory Used: 92.32GB Walltime Used: 00:03:33 iv) CPU Time Used: 00:22:58 Memory Used: 23.49GB GPU Utilisation: 216% GPU Memory Used: 100.6GB Walltime Used: 00:03:37

Increasing cores for slow5_threads alone produces a modest decrease in time and increased GPU usage (ii), but probably limited by the number of reads simultaneously read (slow5_batchsize). After increasing slow_batchsize to the default 4,000 produces the greatest decrease in time (iii). Increasing slow5_threads (iv) increases the GPU utilisation (132% to 216%) but in a similar time. The lack of decreased time is likely due to the few reads used during the test. Only 8,041 reads were processed so this is only just over 2*slow5_batchsize... So I anticipate this will likely help with much larger files.

Nevertheless, even if people are testing with small files to begin with, it still shouldn't hurt to have these parameters already using the full capacity of the NCI rss requested? Then it's ready for both small and large runs. It would also save foolish people like myself using up valuable KSU on incomplete runs...

SBurnard commented 1 month ago

Wow thank you @Psy-Fer for the very prompt response! Sorry I was busy finishing my above message and didn't see your speedy response!

It might just be because I'm not used to some sort of standard github format (i.e. making sure to check the docs folder), but I had assumed most descriptions would be on the top readme file. Maybe it might be worth linking to that doc within the readme file when referring to the usage of threads? Regardless, I'll make a mental note to double check within 'doc' folders.

I'll amend the scripts I'm using with your recommendations. Thanks.

SBurnard commented 1 month ago

@hasindu2008

When you have a chance, could you implement @Psy-Fer recommended thread usage with the NCI scripts?

As NCI requires GPU cores as multiples of 12, it might be better to use:

num_threads=$((${PBS_NCPUS}-6))
slow5_read_threads=$(($num_threads/2))
slow5_worker_threads=$(($num_threads/2))

I just checked, and it should round the integer down anyway (so works out the same), but this is a maybe a bit safer and makes it a bit more clear how many cores to expect being used.

Cheers, Sean

Hey,

If you have a look at https://github.com/Psy-Fer/buttery-eel/blob/main/docs/thread_model.md You'll see a breakdown of how the thread model works.

A small amendment to the above proposed way of maximising the thread utilisation would be like so:
num_threads=${PBS_NCPUS}-5
slow5_read_threads=$(($num_threads/2))
slow5_worker_threads=$(($num_threads/2))
where the -5 is to account for the following

1 for reading. The slow5_threads are decompression threads, there is still 1 thread handling the actual data and pushing it into the queue.

1 for writing. This pulls from the output queue and writes the data to the appropriate files

1 for running buttery-eel

1-2 for guppy/dorado interprocess communication and operation of the server.

(1) for a dummy client that makes sure everything is actually working (though doesn't really use much, so I guess could be ignored)

Hope that helps.

James

hasindu2008 commented 1 month ago

Hi @SBurnard When we tested with Guppy+buttery-eel on NCI, the GPU utilisation was quite good for the current parameters in the script. With Dorado+buttery-eel this should have changed, especially with the new SUP models.
At the moment @kisarur who is at NCI is looking at optimisation of the parameters in this context.
Tagging him so that he sees this information too.

@SBurnard What model are you using and what is the size of the BLOW5 file?

SBurnard commented 1 month ago

Hi @hasindu2008

The model I'm using is from 'dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_sup.cfg'

The BLOW5 file is 1.3Tb, and contains ~6,500,000 reads.

The run I completed after changing the parameters to --slow5_threads 48 --procs 20 --slow5_batchsize 4000 finished with the following info from the logfile:

CPU Time Used: 327:26:19 Memory Used: 338.09GB GPU Utilisation: 215% GPU Memory Used: 92.35GB Walltime Used: 23:50:52

hasindu2008 commented 1 month ago

Oh, the mobases models. So you should be right, I don't think I benchmarked the modbase models much. Thanks for this.

SBurnard commented 1 month ago

Most welcome. I'm happy to help you @hasindu2008 and @kisarur (couldn't tag him properly) with optimisation regarding modbase models. Hopefully the information above helps a bit already!

Thanks for all your help thus far. :)

hasindu2008 commented 1 month ago

So as tested by @kisarur which I briefly stated at https://github.com/Psy-Fer/buttery-eel/issues/55, this 100 batch size seems to have been valid for both Guppy and Dorado until this latest eel+dorado combination. I have updated the latest eel+dorado scripts to use 4000 slow5 batch size here, https://github.com/hasindu2008/nci-scripts/commit/e8454fc556b53f35e7b489adb4b11d0f69b275ba. But for the past scripts, I have left at 100, as 100 seems to have been good value for those versions.

Thank you very much for taking the time to dig into this and report this. @Psy-Fer will in the meantime investigate the reason for this sudden change. I suspect it is because the ONT's Dorado server-client approach was revamped in this recent release. But I am surprised that it affects the slow5 batch size.

SBurnard commented 1 month ago

You're most welcome.

For reference, I just completed another modcall run using the updated specs you've committed slow5_threads=10, procs=20, slow5_batchsize=4000 (with buttery-eel/0.5.1+dorado7.4.12)

Time for completion seems on par for the previous run I reported.

blow5 file size: 950GB Total reads processed: 6,424,583

NCPUs Requested: 48
NCPUs Used: 48 CPU Time Used: 245:23:14 Memory Requested: 370.0GB
Memory Used: 236.53GB NGPUs Requested: 4
GPU Utilisation: 160% GPU Memory Used: 106.88GB Walltime requested: 48:00:00
Walltime Used: 18:30:32 JobFS requested: 100.0MB
JobFS used: 0B

hasindu2008 commented 1 month ago

Hello, great - seems like the GPUs are being used effectively now, way better than it was before. This batch size matter seems to have appeared only from this newer version of buttery-eel+dorado server combination, as it was confirmed and explained in https://github.com/Psy-Fer/buttery-eel/issues/55#issuecomment-2415595930.

Also, another way to check the GPU usage is, when your job is running, do a qstat -n which will print the node that your job is running. Then you can ssh that node and use watch -n0 nvidia-smi to observe the GPU utilisation. If you see all GPUs close to 100% for the majority of the time, it means GPUs are being effectively used. If you see considerable gaps, that means there are more opportunities to optimise.

hasindu2008 / nci-scripts

flag included for buttery-eel dorado NCI example script doesn't work with latest buttery-eel version #1

PBS -l ncpus=48

PBS -l ngpus=4

PBS -l mem=384GB