relab / hotstuff

MIT License
167 stars 53 forks source link

Use protobuf types instead of hotstuff-defined Go types with translation layers #43

Open meling opened 2 years ago

meling commented 2 years ago

It would be nice to avoid translations between protobuf types and hotstuff-defined types, such as those in hotstuffpb and corresponding translations. Such translations slows things down, adds memory allocation overhead, and is prone to errors in the translation code.

johningve commented 2 years ago

Looking at some profiles, it doesn't seem like the conversions are slowing things down much.

Here's a memory profile:

File: hotstuff
Type: alloc_space
Time: Apr 16, 2022 at 12:03pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 652.07MB, 57.91% of 1126.06MB total
Dropped 174 nodes (cum <= 5.63MB)
Showing top 10 nodes out of 173
      flat  flat%   sum%        cum   cum%
  156.01MB 13.85% 13.85%   156.01MB 13.85%  math/big.nat.make
  107.02MB  9.50% 23.36%   212.02MB 18.83%  github.com/relab/hotstuff/crypto/ecdsa.ThresholdSignature.ToBytes
   82.50MB  7.33% 30.69%    82.50MB  7.33%  math/big.(*Int).Bytes (inline)
      78MB  6.93% 37.61%   138.51MB 12.30%  github.com/relab/hotstuff/crypto/ecdsa.Signature.ToBytes
   57.50MB  5.11% 42.72%    57.50MB  5.11%  context.WithCancel
   53.50MB  4.75% 47.47%    53.50MB  4.75%  reflect.New
   35.51MB  3.15% 50.62%    35.51MB  3.15%  google.golang.org/protobuf/proto.MarshalOptions.marshal
   34.51MB  3.06% 53.69%   212.03MB 18.83%  github.com/relab/hotstuff/crypto.(*cache).VerifyThresholdSignature
      24MB  2.13% 55.82%       24MB  2.13%  google.golang.org/protobuf/internal/impl.consumeBytesNoZero
   23.51MB  2.09% 57.91%    24.01MB  2.13%  google.golang.org/grpc.(*parser).recvMsg

memprofile.pb.gz

Here's a cpu profile:

File: hotstuff
Type: cpu
Time: Apr 16, 2022 at 12:01pm (CEST)
Duration: 70.65s, Total samples = 74.83s (105.91%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 43400ms, 58.00% of 74830ms total
Dropped 770 nodes (cum <= 374.15ms)
Showing top 10 nodes out of 217
      flat  flat%   sum%        cum   cum%
   13490ms 18.03% 18.03%    15130ms 20.22%  syscall.Syscall
   13130ms 17.55% 35.57%    13130ms 17.55%  runtime.futex
    6400ms  8.55% 44.13%     6400ms  8.55%  p256MulInternal
    3090ms  4.13% 48.26%     3090ms  4.13%  p256SqrInternal
    1710ms  2.29% 50.54%     1710ms  2.29%  runtime.epollwait
    1450ms  1.94% 52.48%     1450ms  1.94%  crypto/elliptic.p256OrdSqr
    1390ms  1.86% 54.34%     6610ms  8.83%  crypto/elliptic.p256PointDoubleAsm
     990ms  1.32% 55.66%      990ms  1.32%  crypto/elliptic.p256Sqr
     890ms  1.19% 56.85%     3540ms  4.73%  runtime.mallocgc
     860ms  1.15% 58.00%      860ms  1.15%  runtime.nextFreeFast (inline)
(pprof) top -cum
Showing nodes accounting for 13.95s, 18.64% of 74.83s total
Dropped 770 nodes (cum <= 0.37s)
Showing top 10 nodes out of 217
      flat  flat%   sum%        cum   cum%
     0.04s 0.053% 0.053%     17.25s 23.05%  github.com/relab/hotstuff/crypto.(*cache).Verify
         0     0% 0.053%     15.77s 21.07%  github.com/relab/hotstuff/crypto/ecdsa.(*ecdsaCrypto).Verify
     0.02s 0.027%  0.08%     15.75s 21.05%  crypto/ecdsa.Verify
         0     0%  0.08%     15.73s 21.02%  crypto/ecdsa.verify (inline)
     0.02s 0.027%  0.11%     15.73s 21.02%  crypto/ecdsa.verifyGeneric
     0.08s  0.11%  0.21%     15.16s 20.26%  runtime.mcall
    13.49s 18.03% 18.24%     15.13s 20.22%  syscall.Syscall
     0.04s 0.053% 18.29%     15.06s 20.13%  internal/poll.ignoringEINTRIO
     0.20s  0.27% 18.56%     14.54s 19.43%  runtime.schedule
     0.06s  0.08% 18.64%     14.11s 18.86%  github.com/relab/hotstuff/crypto/ecdsa.(*ecdsaCrypto).VerifyThresholdSignature.func1

cpuprofile.pb.gz

hanish520 commented 2 years ago

When I am doing experiments to derive the highest possible throughput from the implementation, I looked at the CPU profile and memory profile of the replicas at the current maximum of 300kops, it appears it is spending considerable time in GC around 11 % after I set GOGC to 2000, before that it was close to 16%.

Screenshot 2022-04-19 at 02 56 18

In the memory profile, a significant majority of the allocation is to translate from proto to hotstuff-defined structures.

Screenshot 2022-04-19 at 02 57 16
meling commented 2 years ago

I guess this is the evidence we need for the performance impact this is having. I haven't studied this in-depth yet, but is it a large change to remove these translations? It would be interesting to see if we can boost the throughput even more without these translations.