Investigate different hashes

laurentsimon commented 1 year ago

E.g. https://www.blake2.net (reduced rounds), https://github.com/BLAKE3-team/BLAKE3

laurentsimon commented 6 months ago

BLAKE3 may be similar to what we've implemented - see comments in https://github.com/sigstore/sigstore-python/issues/1018#issuecomment-2111322094. We would need to package the Rust or C implementation via py bindings, I think. Is there an existing (cross-platform) package doing this?

laurentsimon commented 6 months ago

There seems to be a package available https://pypi.org/project/blake3/

laurentsimon commented 6 months ago

One thing to be aware of is that blake2 is not a NIST hash, and blake3 seems based on a similar idea (reduced number of rounds in the compression function). It also fixes the shard size to 1KB, which we would likely want to be parameterizable in practice.

smeiklej commented 6 months ago

When we benchmarked model signing for different model sizes, disk performance became the bottleneck quite quickly, so my sense is that changing the hash function would have little to no impact.
(Graph is here if you're curious: signing_runtime.pdf.)

That being said, supporting different hash functions is good for the sake of modularity. But that seems like a different goal from what you're talking about here.

laurentsimon commented 6 months ago

How do you know disk performance was the bottleneck? It could be that the number of vCPUs was the bottleneck, since we parallelize by dispatching computation on each vCPU. Is there another graph showing that disk IO was the bottleneck?

smeiklej commented 6 months ago

That's fair, we don't know for sure (for reference though the machine had 48 vCPUs). But even if the bottleneck is the number of vCPUs I still don't see why a different hash function would affect performance?

laurentsimon commented 6 months ago

If the hash can compute faster each shard on each vCPU, we'd get a lower plateau (first part of graph) and a slower linear increase (second part of graph). Of course, memory also comes into play and could become the bottleneck too :) Wdut?

smeiklej commented 6 months ago

One thing I forgot to mention: You can see on the graph that the results are very similar across M2 and M3 (and at the end identical), despite the fact that M3 has twice the number of cores. So this again suggests that the bottleneck is disk IO.

@mihaimaruseac has a graph about memory usage showing that it really never got that high (like 4-8 GB RAM if I remember correctly) so I think that unless we seriously increase the shard size this is unlikely to become a bottleneck.

mihaimaruseac commented 6 months ago

How do you know disk performance was the bottleneck? It could be that the number of vCPUs was the bottleneck, since we parallelize by dispatching computation on each vCPU. Is there another graph showing that disk IO was the bottleneck?

Here's the serial implementation numbers

           size   time cold   mean time      stddev
           8192     0.00121     0.00023     0.00016
          16384     0.00114     0.00046     0.00023
          32768     0.00121     0.00019     0.00000
          65536     0.00149     0.00037     0.00018
         131072     0.00159     0.00044     0.00018
         262144     0.00224     0.00054     0.00010
         524288     0.00417     0.00095     0.00009
        1048576     0.00659     0.00157     0.00011
        2097152     0.00954     0.00288     0.00035
        4194304     0.01621     0.00456     0.00008
        8388608     0.03048     0.00810     0.00070
       16777216     0.05604     0.01775     0.00066
       33554432     0.10647     0.03329     0.00088
       67108864     0.23366     0.06219     0.00093
      134217728     0.40839     0.12064     0.00082
      268435456     0.83653     0.23720     0.00126
      536870912     1.68094     0.46663     0.00408
     1073741824     3.43046     0.83006     0.00222
     2147483648     6.42181     1.53563     0.00295
     4294967296    12.99146     3.06277     0.00592
     8589934592    26.38218     6.23468     0.01291
    17179869184    52.69675    12.24823     0.01216
    34359738368   104.40937    24.60314     0.02284
    68719476736   209.93898    49.40612     0.08544
   137438953472   413.53302    98.65482     0.10862
   274877906944   835.76904   831.67746     2.00515
   549755813888  1701.84040  1693.82841     6.66369
  1099511627776  3359.77479  3334.71729     4.79792

Doubling model size approximately doubles the time here. But now, let's see the optimized method:

           size   time cold   mean time      stddev
           8192     0.05316     0.05220     0.00049
          16384     0.05487     0.05160     0.00094
          32768     0.05497     0.05160     0.00060
          65536     0.05344     0.05194     0.00048
         131072     0.05247     0.05176     0.00051
         262144     0.05273     0.05190     0.00065
         524288     0.05627     0.05356     0.00069
        1048576     0.05863     0.05388     0.00041
        2097152     0.05962     0.05443     0.00133
        4194304     0.06749     0.05538     0.00044
        8388608     0.08282     0.05992     0.00061
       16777216     0.10344     0.06661     0.00092
       33554432     0.15928     0.08324     0.00070
       67108864     0.25431     0.10588     0.00083
      134217728     0.45653     0.15577     0.00079
      268435456     0.88208     0.25272     0.00113
      536870912     1.70203     0.45228     0.00157
     1073741824     3.44713     0.83517     0.00257
     2147483648     3.42376     0.84252     0.00330
     4294967296     3.96692     0.87051     0.00325
     8589934592     7.30940     0.91070     0.01196
    17179869184    14.11854     1.10208     0.02410
    34359738368    27.75890     1.37019     0.02251
    68719476736    55.51175     2.83477     0.04802
   137438953472   110.55379     5.02267     0.07931
   274877906944   226.84960   219.26083     0.01333
   549755813888   442.62415   441.70884     0.01044
  1099511627776   878.59498   868.93802     6.38893

See the huge jump at the end, that is caused by disk read. Confirmed by watching the process during hashing with htop, it was in D state for most of the time.

RAM usage was never more than 4GB (including Python interpreter size, etc.)

laurentsimon commented 6 months ago

Thanks. So after 150GB or so, it's better to store the model files on different SSDs, correct?

Also, for files < 100MB or so, naive hashing is faster (I suspect because there's no thread / process scheduling contention), correct?

mihaimaruseac commented 6 months ago

I think so.

I'm actually planning to add microbenchmarks as part of the API redesign. Will hopefully send a PR today (need to do the same internally, so depends on how I get the CL reviews)

laurentsimon commented 6 months ago

A few other things to report in the graph / benchmark:

How many processes are created and running in parallel, and if possible how often they are waiting on IO. One motivation to report this is understand the sudden jump for model size > 274877906944B. One hypothesis could be that the library sees CPU-intensive tasks being run and decide to start more processes. Then they end up competing for disk reads. A solution could be to limit the maximum number of processes / threads created.
What is the memory used on the machine, maybe via /proc/slabinfo or proc/meminfo (I think top only reports process memory)? The performance for model sizes > 274877906944 is ~ those for cold benchmarks. One hypothesis is that the page cache is full which causes high-level of cache misses. One workaround here would be to increase the RAM on the machine.

mihaimaruseac commented 1 week ago

This would involve more benchmarking so assigning to @spencerschrock

sigstore / model-transparency

Investigate different hashes #13