Closed aeneasr closed 3 years ago
Additionally:
Recovering seems not to be an option: https://stackoverflow.com/questions/30577308/golang-cannot-recover-from-out-of-memory-crash
I guess the argon2 hasher should then have a queue with a lenght of max-concurrent
? That is the only way I see we can prevent OOM.
Sequential: Parallel: So streamlining argon2 will definitely prevent OOM but might result in some wait time for individual requests. The solution will be to allow a certain amount of concurrent operations while streamlining everything that is above that. The CLI helper has the purpose of finding the configuration values. Therefore it should consider the number of operations that Kratos should be able to handle concurrently. This highly depends on the application and usage patterns. We can therefore not make assumptions about the distribution or normal/max/average load.
In a real world deployment it is a statistical problem how many requests should be supported concurrently. It depends on the rate of requests (#/minute) and the execution time of a single request. The execution time depends linearly on memory, iterations, and concurrent executions: (XC means X concurrent, while 1C is in the plot twice; med means median)
With concurrency comes a higher deviation (probably because of scheduling/resource allocation). Therefore the parameters have to be chosen in respect to the statistically relevant events. Compare the following plots. Both use the same parameters for hashing, designed for a compute time of ~0.1s.
For 256 requests/min the standard deviation is pretty high, resulting in very unpredictable latency. Also the maximum compute time of over 1.5s is way over the desired time of 0.1s resulting in bad UX.
In the case of 32 requests per minute the standard deviation is very low and both min and max values are perfectly in the range of the desired time.
Here are some more stats without visual plots. Note that these where taken on a machine running other applications as well.
TOTAL SAMPLE TIME 55.675696486s
MEDIAN 181.180305ms
STANDARD DEVIATION 51.922335ms
MIN 120.312156ms
MAX 342.119787ms
MEMORY USED 1.10GB
TOTAL SAMPLE TIME 1m0.107102303s
MEDIAN 174.04383ms
STANDARD DEVIATION 61.851463ms
MIN 102.905238ms
MAX 373.156979ms
MEMORY USED 1.10GB
TOTAL SAMPLE TIME 59.443645821s
MEDIAN 195.619608ms
STANDARD DEVIATION 85.098722ms
MIN 103.864426ms
MAX 566.401984ms
MEMORY USED 1.88GB
TOTAL SAMPLE TIME 1m0.087153479s
MEDIAN 215.543949ms
STANDARD DEVIATION 83.693541ms
MIN 103.639096ms
MAX 518.921044ms
MEMORY USED 2.13GB
TOTAL SAMPLE TIME 59.857562916s
MEDIAN 322.990349ms
STANDARD DEVIATION 264.79309ms
MIN 102.390406ms
MAX 1.653100701s
MEMORY USED 4.20GB
So the goal has to be to find values that result in a low standard deviation while meeting the requirements of acceptable min/max times and required memory for an expected rate of login requests. Everything above that expected rate should be queued to prevent OOM as much as possible.
Here are some example data that try to find the best values for ~0.1s with 64req/min for my machine. This is what I want users to do and tune to their requirements.
Too much memory than I would like to dedicate:
TOTAL 59.628672241s
MEDIAN 412.544673ms
STANDARD DEVIATION 210.574957ms
MIN 225.377607ms
MAX 1.109833309s
MEMORY USED 4.20GB
(1 iteration; 512MB memory)
Too much CPU usage:
TOTAL 1m0.265162692s
MEDIAN 889.290351ms
STANDARD DEVIATION 481.599671ms
MIN 462.723141ms
MAX 2.537232562s
MEMORY USED 1.62GB
(10 iterations; 128MB memory)
A seemingly good configuration:
TOTAL 59.285180675s
MEDIAN 173.700229ms
STANDARD DEVIATION 57.711584ms
MIN 101.515008ms
MAX 366.054783ms
MEMORY USED 731.98MB
(2 iterations; 128MB memory)
Sweet!
Is your feature request related to a problem? Please describe.
--max-concurrent
argument which defines the maximum amount of concurrent password hashing processes. you could use go routines to dispatch the amount of concurrent subroutines. since this is the maximum, there should also be something which checks for the low and average or median amount of concurrent users. the deviation of the target time should not be too high for low, mid, high concurrent users. in general, I think the system memory / 2 (to leave room for other processes) / amount of concurrent users should be the maximum memory allowed to be requested.[ ] config.schema.json needs a description for the config fields and sane defaults:
Here's what I wrote:
Manual Configuration
You may also choose the Argon2 parameters manually.
:::note
Please keep in mind that your host machine is probably doing more than just computing Argon2 hashes, so choose these parameters wisely. It is also important that ORY Kratos will probably compute several hashes in parallel, depending on how many concurrent logins or registrations you have.
:::
To configure Argon2, edit the ORY Kratos configuration file: