nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
477 stars 59 forks source link

Benchmarking #239

Closed adRn-s closed 1 year ago

adRn-s commented 1 year ago

I had read this article, and it was very good. Thanks!

Now, I wonder about the Amazon's storage solution here. I assume it was 100% SSD. Yet, how much of a performance penalty would it incur if HDD were used? I understand basecalling is a throughput-sensitive process. Now, wouldn't it be competent enough having 6 HDD in RAID0 vs. 1 SSD?

Any measurements or rules of thumbs would be highly appreciated. At our institute, we are under the process of discussing potential server upgrades given the pushing requirements from long read sequence data. We may run some benchmark of our own, so any comments on these would be greatly appreciated too. Thanks in advance!

PS. I will close the issue soon, I understand this is no software bug. Sorry for the noise.

iiSeymour commented 1 year ago

Hey @adRn-s

The tl;dr response is go for SDD; standard simplex basecalling is throughput-sensitive but other analysis tasks will suffer with the extra latency/seeking for HDD (i.e. duplex basecalling).

On throughput, there are a few consideration:

The reported performance in sample/s in the article is very roughly equivalent to the throughput needed in bytes so taking the p4d.24xlarge results, that would require 490MB/s. With a few assumptions; 6 HDDs (at 100MB/s) in RAID0 ~= 1 SATA SDD 550MB/s which could in theory keep up with HAC calling on 8x A100s. This assumes the disks are local to the node and no other bottlenecks in the chain. This probably isn't a good assumption and would be too close for comfort i.m.o.

billytcl commented 1 year ago

Just wanted to jump onto this ticket with another question: Would there be a significant slowdown if I were writing the basecalls to network storage vs on local disk? This is assuming I am basecalling off a pod5 on local SSD storage as input.

vellamike commented 1 year ago

@billytcl it depends on your network storage speed of course, but on most network storage you will likely be fine.

If you were basecalling a HAC model at say 100 Million Samples per second this would generate approximately 10MB/s of data in BAM - unless your network storage is extremely slow it should easily be able to handle this.

The best way to test is to write to /dev/null and to your network storage and compare the difference.