NIF queue locking up BEAM

christianjgreen commented 6 years ago

We have been using argon2 in production for a while and have started to get an increase in the number of concurrent users attempting to log in during a push notification. We received about 7500 password hash requests on a 8 core PM over the course of 10 seconds. These locked down our api and it was not able to recover for ~15 minutes. Even though the requests to hash were timed out, it seems that the hashing requests still got queued by the dirty schedulers.

Here is an example test we ran. 16 hashes/s

At 16 hashes per second, our server was brought after 30 seconds.

Do you have any insight on how we can increase performance, or at least keep the NIF form locking the entire VM?

Running on OTP 20.2 elixir 1.6.2 Heroku PM Dedicated Instance

michalmuskala commented 6 years ago

The password hashing functions are designed to be slow - usually the recommendation is for hashing to take at least 0.5s. Using that for a back-of-the-envelope calculation we get that just calculating the 7.5k hashes on an 8 core machine should take: 7500 * 0.5 / 8 / 60 =~ 8 minutes. 15 minutes sounds very realistic in that case.

The problem is that the dirty schedulers were saturated during that time and probably other things inside the VM that need to execute on dirty schedulers (particularly large garbage collections and potentially other things) weren't able to execute leading to the lockup. That's generally a common starvation issue with a thread pool - I'm not sure much can be done on the library side about it without rewriting the NIFs completely to be able to yield in the middle of computation.

A solution might be to implement some rate-limiting for the login endpoint or hashing calls in particular.

christianjgreen commented 6 years ago

thanks @michalmuskala <3

riverrun commented 6 years ago

@Arthien thanks for raising the issue, and @michalmuskala thanks for your explanation. Although there is not much I can do about this situation, it's good to know about some of the issues people are facing.

riverrun / argon2_elixir

NIF queue locking up BEAM #19