Performance implications of PBKDF2-HMAC-SHA* in Erlang

NelsonVides commented 3 years ago

I think there are some problems with the implementation of pbkdf2 in this codebase. Bear with me for a second, I'm going to get technical.

TL;DR;

I propose an implementation for the PBKDF2 algorithm, NIF based, that is 7.5x faster, consumes constant memory regardless of the iteration count (erlang/elixir implementations will always consume memory linearly to the iteration count), has better GC patterns, better latency, and it's very well tested by unit-tests, property-based testing, and used in production systems.

Intro

I faced this problem when introducing higher iteration counts and more hashes for the SCRAM protocol in MongooseIM, we noticed that suddenly the load increased worryingly. When going from 4096 iterations for SHA-1, to 10k iterations for each and all of the five SHA-1 and SHA-2 versions, memory consumption went mad and user registration went super heavy on the server. The problem was the same you have here is that your code is making the same naive mistake.

Memory consumption

PBKDF2 is a mechanism that is designed to be CPU demanding and not parallelizable. It is also not designed to be memory consuming, nor memory-unpredictable like other argon or bcrypt derivations. The problem of implementing it in a language with no mutation, like Erlang, is that every iteration creates a new HMAC-sized block to compute over. Even if your implementation is tail-recursive and the stack doesn't grow, the runtime will not GC the binaries created on each iteration until you have fully exited the whole call-stack. And by then, the heap memory for the caller will be set to be very big and won't need a fullsweep GC for quite a while (the BEAM has very lazy mechanisms for GC-ing a single process), hence possibly keeping lots and lots of wasted memory for quite a while. This also costs lots of CPU, to allocate and deallocate so rapidly so much memory.

CPU Efficiency

PBKDF2 relies on a pseudo-random function, which is almost always HMAC, and HMAC relies on a one-way hash-function, which is almost always SHA1 or some SHA-2. What you're implementing here is not strictly PBKDF2, but rather PBKDF2-HMAC-SHA*. Then, the contents are running over password || salt, which is designed to be two hash blocks of memory, hence according to Merkle-Damgard means that SHA will run two compression functions, but HMAC runs and inner and an outer layers, hence four compression functions per HMAC... But! two of these 4 functions will always be based on the original password xored with a static value, so they can be cached, and, having to run n iterations for PBKDF2, you'd naively run 4n compression functions, when in reality you only need 2 + 2n: such HMAC optimisation is even mentioned in its own RFC here.

A solution

When facing this problem in MongooseIM, I took a very good open-source C code project solving so, extended it to implement all SHA2 hashes, made it time-slice the BEAM way to be scheduled correctly and respect the latency properties of the BEAM, and put it all in a separate repo, where over time I also implemented all of the SCRAM protocol. Now I've made a PR making possible to extend the blocks for PBKDF2 — it's actually not that useful, PBKDF2 ain't much better when you increase the block size, but well, to be strictly compliant with PBKDF2 and hopefully allowing you to migrate to the C implementation 😉

Some benchmarks:

Using benchee with this code

Benchee.run(
  %{
    "fast_scram   " =>
        fn {password, salt, it_count} -> :fast_scram.hi(:sha256, password, salt, it_count) end,
    "pbkdf2_elixir" => fn {password, salt, it_count} -> Pbkdf2.Base.hash_password(password, salt, rounds: it_count, digest: :sha256) end
  },
  inputs: %{
    "1. 8" => :bench_erl.pbkdf2_input(8),
    "2. 512" => :bench_erl.pbkdf2_input(512),
    "3. 4096" => :bench_erl.pbkdf2_input(4096),
    "4. 10000" => :bench_erl.pbkdf2_input(10000),
    "5. 160000" => :bench_erl.pbkdf2_input(160000),
    "6. 500000" => :bench_erl.pbkdf2_input(500000)
  },
  parallel: 12,
  time: 5,
  memory_time: 5
)

where input is generated like:

pbkdf2_input(ItCount) ->
    Password = base64:encode(crypto:strong_rand_bytes(10)),
    Salt = base64:encode(crypto:strong_rand_bytes(16)),
    {Password, Salt, ItCount}.

The results are, for the default 160k iteration count your code has:

##### With input 5. 160000 #####
Name                    ips        average  deviation         median         99th %
fast_scram             8.93      111.98 ms     ±0.82%      111.81 ms      119.01 ms
pbkdf2_elixir          1.19      840.62 ms     ±1.10%      837.97 ms      881.12 ms

Comparison:
fast_scram             8.93
pbkdf2_elixir          1.19 - 7.51x slower +728.65 ms

Memory usage statistics:

Name             Memory usage
fast_scram         0.00007 MB
pbkdf2_elixir        21.83 MB - 317871.56x memory usage +21.83 MB

##### With input 6. 500000 #####
Name                    ips        average  deviation         median         99th %
fast_scram             2.85         0.35 s     ±0.47%         0.35 s         0.36 s
pbkdf2_elixir          0.38         2.62 s     ±0.88%         2.62 s         2.70 s

Comparison:
fast_scram             2.85
pbkdf2_elixir          0.38 - 7.47x slower +2.27 s

Memory usage statistics:

Name             Memory usage
fast_scram         0.00007 MB
pbkdf2_elixir        68.21 MB - 993325.56x memory usage +68.21 MB

That is, around 7 times faster, but memory consumption is constant regardless of the iteration count, while the erlang/elixir code grows memory linearly with the iterations.

And it's not even faster because of the timeslicing, which has better latency properties, but introduces overhead – it was around twice as fast before introducing such timeslicing, but nevertheless I'd argue that in the BEAM, latency is more important than performance. Needless to say that this solution will also have better latency in extreme cases: when heaps are too big, GC is ran in dirty schedulers, and the NIF code doesn't make the heap grow, so it doesn't introduce that risk.

Disclaimer: I'm the author of fast_scram and I will also explain all I explained here and much more stuff about SCRAM and PBKDF2 here https://codesync.global/speaker/nelson-vides/ 😬

riverrun commented 3 years ago

Thanks for raising this issue. I will try to have a look at it in more depth soon.

ewildgoose commented 2 years ago

Would be preferable to migrate towards rust or zig implementations of NIFs?

a) they are both relatively "safer" languages to implement things in b) good cross compile compatibility (zig especially) c) good support, at least within elixir (ziggler/rustler)

NelsonVides commented 2 years ago

Would be preferable to migrate towards rust or zig implementations of NIFs?

a) they are both relatively "safer" languages to implement things in b) good cross compile compatibility (zig especially) c) good support, at least within elixir (ziggler/rustler)

I'd first of all raise a concern with building pipelines. I have plenty of projects in Erlang where I'd love to integrate Rust and/or Elixir, but that'd mean adding new tooling to the building scripts and in some environments that can add a lot of complexity. And I imagine PBKDF as attempting to be a very simple library with no implicit dependencies.

There's also a new function from OTP, where they integrated the PBKDF API from OpenSSL into the crypto app from OTP24.2. Essentially different from the C code I had, this one runs OpenSSL code which was slightly slower than my NIF (on newer openssl, otherwise , but my NIF runs on a regular scheduler with timeslicing, which adds overhead, so I don't know which one is faster now, haven't tested it. Nevertheless the OTP one has one huge advantage: no new dependencies are added. The problem is then in those less common systems where OpenSSL is not used.

So a separate library with a Rust implementation can also help. I think they'd all rely on supply-chain decisions, whether new dependencies are allowed, openssl is preferred, an older version of OTP is used, etc.

What is sure to be the worst option by far is having this code in Erlang. Only makes sense when pure erlang is required for whatever supply-chain constrains and this is called infrequently and/or with low iteration counts.

ChristopheBelpaire commented 8 months ago

Hello! I noticed the same problem, on my t2.medium instance it take something like 3 seconds to check check the password. I changed it to fast_pbkdf2 and it is much faster! Could it be a good idea to delegate the encryption to fast_pbkdf2 ?

riverrun / pbkdf2_elixir