nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
441 stars 54 forks source link

NVidia RTX 6000 Ada generation benchmark #706

Open tnn111 opened 3 months ago

tnn111 commented 3 months ago

I've been asking for dorado benchmarks so I thought I'd contribute one too. I just put together a system with an AMD 7975X CPU, ASUS WRX90E motherboard, DDR5 memory, Sabrent 8TB SSD and an NVidia RTX 6000 Ada (48 GB) card.

I ran dorado on ~23 GB of pod5 files. The command was

dorado basecaller --verbose sup,5mC_5hmC,6mA pod5

The result was

Basecalled @ Samples/s: 2.955935e+06

Compare to an NVidia A100 (40 GB) card with the same data

Basecalled @ Samples/s: 3.756020e+06

So the RTX 6000 Ada is ~80% as fast as the A100.

Does anyone have a 4090 they could do benchmarks on using a substantial dataset? Right now, I'm inclined to build a liquid cooled system with 4 of the RTX 6000 Ada cards in it.

Thanks.

vellamike commented 3 months ago

Hello @tnn111,

When you compare to the A100 what benchmark are you using? Is it on the same system with only the GPU swapped out? Benchmarking is relatively complex because the balance of CPU, Disk and GPU performance all interact, especially with all context modified base calling.

Mike

tnn111 commented 3 months ago

Hi Mike,

The A100 I compared against was an interactive node on the US DOE NERSC supercomputer facility. I agree that benchmarking is indeed complex, but I think this is within the +-5% range based on some experience with these sorts of measurements.

I do have an A100 sitting on my desk though and I intend to do a benchmark where it's a simple swap of the GPU and nothing else changed. The reason I haven't done it yet is that the A100 is passively cooled and I'm not sure I can keep it cool enough in a standard desk side workstation.

I wish ONT would make available benchmarks for different GPUs. I'm looking for something informative more than anything else. I'm just trying to pick the most affordable processing system for Nanopore data.

Thanks, Torben

ymcki commented 3 months ago

Speed at ~80% is better than expected. We are getting some L40Ses soon and will do the similar benchmarking.

Is it possible that you also align HG002 reads to HG002v1.0.1 to compare performance between the two like what I did here? https://github.com/nanoporetech/dorado/issues/702

Thx

billytcl commented 3 months ago

L40S would be great! Those are spec'ed higher than A100 on paper but I believe the memory bandwidth is lower. Would love to see the side by side comparison.

vellamike commented 3 months ago

@tnn111

The reason I haven't done it yet is that the A100 is passively cooled and I'm not sure I can keep it cool enough in a standard desk side workstation.

I wouldn't advise that you do this - it's very likely that the A100 will temperature throttle and there is also a possibility that it will shut down as a protective measure.

I wish ONT would make available benchmarks for different GPUs. I'm looking for something informative more than anything else. I'm just trying to pick the most affordable processing system for Nanopore data.

We perform most benchmarks on A100 towers which are shipped with our PromethION systems. The primary reason why we don't do more extensive benchmarks on Nvidia hardware is that the number of combinations of GPU,CPU,RAM, canonical basecall model (HAC/SUP) and modified base model (5mcG, 6mA, 5mC etc...) is extremely large. I do understand that this can be a little bit frustrating when setting up your own system for basecalling.

In general I would say that your RTX 6000 Ada numbers are roughly in-line with my expectations.

samuelmontgomery commented 3 months ago

I have a PC with an i7-13700F, 64GB of DDR5 RAM, and 8TB of NVME storage with a 4090 24GB card

Running a similar command (C:\dorado\bin\dorado.exe basecaller sup,6mA,5mC_5hmC pod5) on a 15Gb dataset (~180GB) gets Basecalled @ Samples/s: 3.871885e+06

I am not sure if this would apply to the A6000, but my samples/s are higher when tweaking the batch size to be slightly smaller than the auto detected batch size on this card (auto: 832, manual setting: 640)

tnn111 commented 3 months ago

Hi Sam,

Thanks! I really appreciate that information. A while ago, Ryan Wick posted benchmarks for his Onion system which were in line with this. Mike Vella gave a cautionary response. But based on what Ryan has done in the past, I assumed it was close to the truth. I was however concerned about the limited amount of data he used for his benchmark.

Your numbers imply that a 4090 is about the same as an A100 40 GB. I’d already decided to buy at least one of the 4090s and try it out. But now I’ll likely go for 2 of them and make it water cooled up front.

I do not need to handle the output of a P24/48. All I need is to run SUP basecalling using a P2 Solo. And I think I can do that and keep up using a couple of 4090s. Or 6000 Ada ones. I’ll probably do both because I have other things I want to do.

Thanks!

Torben

On Mar 27, 2024, at 00:52, samuelmontgomery @.***> wrote:

I have a PC with an i7-13700F, 64GB of DDR5 RAM, and 8TB of NVME storage with a 4090 24GB card

Running a similar command (C:\dorado\bin\dorado.exe basecaller sup,6mA,5mC_5hmC pod5) on a 15Gb dataset (~180GB) gets Basecalled @ Samples/s: 3.871885e+06

I am not sure if this would apply to the A6000, but my samples/s are higher when tweaking the batch size to be slightly smaller than the auto detected batch size on this card (auto: 832, manual setting: 640)

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/706#issuecomment-2022147345, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMXPRRSB6R7BIAYBNHHEEDY2J3EFAVCNFSM6AAAAABFE57QE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRSGE2DOMZUGU. You are receiving this because you were mentioned.

samuelmontgomery commented 3 months ago

I can confirm that one flow cell on the P2 can keep up with live basecalling in SUP, running both positions can keep up in HAC That's using the version of Dorado in MinKNOW

tnn111 commented 3 months ago

Thank you so much! That is what I needed to know.

I sincerely wish that ONT would be more transparent about these kinds of issues, but I realize that they want to sell towers. In the long run, not separating sequencing from processing is unfortunate.

On Mar 27, 2024, at 19:43, samuelmontgomery @.***> wrote:

I can confirm that one flow cell on the P2 can keep up with live basecalling in SUP, running both positions can keep up in HAC That's using the version of Dorado in MinKNOW

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/706#issuecomment-2024298730, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMXPRSGXPKC2PJAY5O3SMTY2N7TRAVCNFSM6AAAAABFE57QE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGI4TQNZTGA. You are receiving this because you were mentioned.

billytcl commented 3 months ago

That’s really cool! I wonder if having a dual 4090 setup would speed it up?

On Wed, Mar 27, 2024 at 7:43 PM samuelmontgomery @.***> wrote:

I can confirm that one flow cell on the P2 can keep up with live basecalling in SUP, running both positions can keep up in HAC That's using the version of Dorado in MinKNOW

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/706#issuecomment-2024298730, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYTY66XGJSXHZ4353M2LY2N7TRAVCNFSM6AAAAABFE57QE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGI4TQNZTGA . You are receiving this because you commented.Message ID: @.***>

tnn111 commented 3 months ago

Hi Billy,

I’d expect that you’d see linear speedup. That’s what I believe ONT has said for other GPUs and I see no reason why this would be different.

I’m planning on buying 2 4090s to test this out. Because each of them 9s 450W TDP, I’m going to do a water-cooled system with them. And I don’t expect to be able to put more than 2 in one chassis. But I think that’s enough for one P2 running SUP.

Thanks all for the help.

Torben

On Mar 27, 2024, at 21:01, billytcl @.***> wrote:

That’s really cool! I wonder if having a dual 4090 setup would speed it up?

On Wed, Mar 27, 2024 at 7:43 PM samuelmontgomery @.***> wrote:

I can confirm that one flow cell on the P2 can keep up with live basecalling in SUP, running both positions can keep up in HAC That's using the version of Dorado in MinKNOW

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/706#issuecomment-2024298730, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYTY66XGJSXHZ4353M2LY2N7TRAVCNFSM6AAAAABFE57QE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGI4TQNZTGA . You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/706#issuecomment-2024354331, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMXPRRKH77QW44BWQEAWTLY2OIZTAVCNFSM6AAAAABFE57QE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGM2TIMZTGE. You are receiving this because you were mentioned.

Psy-Fer commented 3 months ago

Hey,

I'm building some 2x 4090 builds at the moment. When I'm done I'll post some benchmarks ☺️

James

ymcki commented 2 months ago

My run time for HG002 0429 with sup model dataset is 12hr on 4xL40S and 8hr on 4xA100. For hac model, it is 2.5hr and 2hr.

So for hac, it is indeed running at 80% of A100 but for sup it is only 67%. Is there something in sup that causes the slow down?

DHmeduni commented 1 month ago

Would be interested in building a workstation, but was asking myself what hardware I would need for adaptive sampling on two flow cells simultaneously. Anyone have any experience with a RTX GPU or an Ax000 Series?

Psy-Fer commented 1 month ago

I've built a system with 2x4090 cards. Running software like dorado, Dorado server or my software wrapper buttery-eel will go quite well. 100% utilisation and it flies along at ~1M reads every 12min or so with hac.

So if you are using redfish which externalises the basecalling outside of minknow you will probably be fine.

If you are using internal minknow, for some reason I only get 50% utilisation of each card during basecalling through the gui and it makes everything half the speed. So I wouldn't trust it for real time stuff.

This is all on Linux. Windows has its own quirks.

We also already do lots of readuntil using 3090 cards and the v100s in the promethion. They all work really well using readfish.

I hope that info helps.

James

tnn111 commented 1 month ago

The most recent benchmark I have for the RTX 6000 Ada Generation (48 GB) shows ~85% of an NVIDIA A100 using a full PromethION run, SUP basecalling and dorado 0.6. The latter is at an HPC facility and I can’t control everything so there may be a little bias. The former is in my own test system using a Threadripper 7975X and M.2 SSDs (Sabrent) along with 512 GB of RAM.

I’ve toyed with trying out the NVIDIA RTX 4090, but I’ve pretty much given up due to time constraints, concerns about the TDP and the memory/driver issues. I’d still love to see benchmarks with larger datasets. It’d be really great to have a collection of benchmarks to compare against.

We mostly use a P2 Solo for PromethION sequencing now and that’s a pretty decent match for a couple of 6000 cards. We use a Mac Studio to interface to the P2 Solo and then we transfer to a Linux system for basecalling with dorado. The Mac Studio isn’t the cheapest solution in terms of hardware, but it’s really fast and easy to set up and that matters.

On May 7, 2024, at 06:43, James Ferguson @.***> wrote:

I've built a system with 2x4090 cards. Running software like dorado, Dorado server or my software wrapper buttery-eel will go quite well. 100% utilisation and it flies along at ~1M reads every 12min or so with sup.

So if you are using redfish which externalises the basecalling outside of minknow you will probably be fine.

If you are using internal minknow, for some reason I only get 50% utilisation of each card during basecalling through the gui and it makes everything half the speed. So I wouldn't trust it for real time stuff.

This is all on Linux. Windows has its own quirks.

I hope that info helps.

James

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/706#issuecomment-2098443722, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMXPRSU24C2DHRPKCRLV4TZBDLAFAVCNFSM6AAAAABFE57QE6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJYGQ2DGNZSGI. You are receiving this because you were mentioned.