Open laurentsimon opened 1 year ago
BLAKE3 may be similar to what we've implemented - see comments in https://github.com/sigstore/sigstore-python/issues/1018#issuecomment-2111322094. We would need to package the Rust or C implementation via py bindings, I think. Is there an existing (cross-platform) package doing this?
There seems to be a package available https://pypi.org/project/blake3/
One thing to be aware of is that blake2 is not a NIST hash, and blake3 seems based on a similar idea (reduced number of rounds in the compression function). It also fixes the shard size to 1KB, which we would likely want to be parameterizable in practice.
When we benchmarked model signing for different model sizes, disk performance became the bottleneck quite quickly, so my sense is that changing the hash function would have little to no impact.
(Graph is here if you're curious: signing_runtime.pdf.)
That being said, supporting different hash functions is good for the sake of modularity. But that seems like a different goal from what you're talking about here.
How do you know disk performance was the bottleneck? It could be that the number of vCPUs was the bottleneck, since we parallelize by dispatching computation on each vCPU. Is there another graph showing that disk IO was the bottleneck?
That's fair, we don't know for sure (for reference though the machine had 48 vCPUs). But even if the bottleneck is the number of vCPUs I still don't see why a different hash function would affect performance?
If the hash can compute faster each shard on each vCPU, we'd get a lower plateau (first part of graph) and a slower linear increase (second part of graph). Of course, memory also comes into play and could become the bottleneck too :) Wdut?
One thing I forgot to mention: You can see on the graph that the results are very similar across M2 and M3 (and at the end identical), despite the fact that M3 has twice the number of cores. So this again suggests that the bottleneck is disk IO.
@mihaimaruseac has a graph about memory usage showing that it really never got that high (like 4-8 GB RAM if I remember correctly) so I think that unless we seriously increase the shard size this is unlikely to become a bottleneck.
How do you know disk performance was the bottleneck? It could be that the number of vCPUs was the bottleneck, since we parallelize by dispatching computation on each vCPU. Is there another graph showing that disk IO was the bottleneck?
Here's the serial implementation numbers
size time cold mean time stddev
8192 0.00121 0.00023 0.00016
16384 0.00114 0.00046 0.00023
32768 0.00121 0.00019 0.00000
65536 0.00149 0.00037 0.00018
131072 0.00159 0.00044 0.00018
262144 0.00224 0.00054 0.00010
524288 0.00417 0.00095 0.00009
1048576 0.00659 0.00157 0.00011
2097152 0.00954 0.00288 0.00035
4194304 0.01621 0.00456 0.00008
8388608 0.03048 0.00810 0.00070
16777216 0.05604 0.01775 0.00066
33554432 0.10647 0.03329 0.00088
67108864 0.23366 0.06219 0.00093
134217728 0.40839 0.12064 0.00082
268435456 0.83653 0.23720 0.00126
536870912 1.68094 0.46663 0.00408
1073741824 3.43046 0.83006 0.00222
2147483648 6.42181 1.53563 0.00295
4294967296 12.99146 3.06277 0.00592
8589934592 26.38218 6.23468 0.01291
17179869184 52.69675 12.24823 0.01216
34359738368 104.40937 24.60314 0.02284
68719476736 209.93898 49.40612 0.08544
137438953472 413.53302 98.65482 0.10862
274877906944 835.76904 831.67746 2.00515
549755813888 1701.84040 1693.82841 6.66369
1099511627776 3359.77479 3334.71729 4.79792
Doubling model size approximately doubles the time here. But now, let's see the optimized method:
size time cold mean time stddev
8192 0.05316 0.05220 0.00049
16384 0.05487 0.05160 0.00094
32768 0.05497 0.05160 0.00060
65536 0.05344 0.05194 0.00048
131072 0.05247 0.05176 0.00051
262144 0.05273 0.05190 0.00065
524288 0.05627 0.05356 0.00069
1048576 0.05863 0.05388 0.00041
2097152 0.05962 0.05443 0.00133
4194304 0.06749 0.05538 0.00044
8388608 0.08282 0.05992 0.00061
16777216 0.10344 0.06661 0.00092
33554432 0.15928 0.08324 0.00070
67108864 0.25431 0.10588 0.00083
134217728 0.45653 0.15577 0.00079
268435456 0.88208 0.25272 0.00113
536870912 1.70203 0.45228 0.00157
1073741824 3.44713 0.83517 0.00257
2147483648 3.42376 0.84252 0.00330
4294967296 3.96692 0.87051 0.00325
8589934592 7.30940 0.91070 0.01196
17179869184 14.11854 1.10208 0.02410
34359738368 27.75890 1.37019 0.02251
68719476736 55.51175 2.83477 0.04802
137438953472 110.55379 5.02267 0.07931
274877906944 226.84960 219.26083 0.01333
549755813888 442.62415 441.70884 0.01044
1099511627776 878.59498 868.93802 6.38893
See the huge jump at the end, that is caused by disk read. Confirmed by watching the process during hashing with htop
, it was in D
state for most of the time.
RAM usage was never more than 4GB (including Python interpreter size, etc.)
Thanks. So after 150GB or so, it's better to store the model files on different SSDs, correct?
Also, for files < 100MB or so, naive hashing is faster (I suspect because there's no thread / process scheduling contention), correct?
I think so.
I'm actually planning to add microbenchmarks as part of the API redesign. Will hopefully send a PR today (need to do the same internally, so depends on how I get the CL reviews)
A few other things to report in the graph / benchmark:
/proc/slabinfo
or proc/meminfo
(I think top
only reports process memory)? The performance for model sizes > 274877906944 is ~ those for cold benchmarks. One hypothesis is that the page cache is full which causes high-level of cache misses. One workaround here would be to increase the RAM on the machine.This would involve more benchmarking so assigning to @spencerschrock
E.g. https://www.blake2.net (reduced rounds), https://github.com/BLAKE3-team/BLAKE3