Argon2: Reintroduce the memory (de)allocation optimization

solardiz commented 3 weeks ago

With #5557, we switched from preserving and reusing Argon2's memory allocation (unless it needs to increase in size) to allocating and freeing for every hash computed. We need to reintroduce our optimization.

As I wrote in https://github.com/openwall/john/issues/2738#issuecomment-343739923:

Regarding "use of preallocated memory", I realized this can be implemented on top of current upstream code as-is using its memory (de)allocation callbacks, which we'll override. In our deallocator, we'll save the pointer and size and mark them as free in our own per-thread variables instead of calling free(). In our allocator, we'll check if the size is same as or smaller than the saved size and if our previous allocation is marked free (in our own Boolean variable) and if so reuse the old allocation, or realloc() as necessary, or error out if the previous allocation is not marked free (shouldn't happen). Better yet, we should reuse (y)escrypt's alloc_region() in there, simplifying some of those checks (alloc_region() already has similar checks) and taking advantage of its explicit huge pages support. We'd only call free_region() from the format's done().

solardiz commented 3 weeks ago

@magnumripper Would you like to take care of this?

magnumripper commented 2 weeks ago

I naively thought that allocation stuff was a changed upstream detail rather than a performance issue. As for memory wiping, I disabled all of that I could find.

I might have a stab at fixing this but not sure when. Perhaps just re-introducing the alloc stuff we had before would be easiest? Although for future upstream updates, callbacks would be much nicer.

solardiz commented 2 weeks ago

Perhaps just re-introducing the alloc stuff we had before would be easiest?

It could be. I'm not sure what's easiest. Please feel free to do it either way.

magnumripper commented 2 weeks ago

FWIW, in my very naive first PoC implementation of argon2 for keepass-opencl (basically using the same code as-is), I got away with simply giving each global work item a fixed chunk of memory, as large as the largest m KiB (among all loaded salts), out of a single large buffer only allocated at format init. There are no small or repeated allocs/frees during a single hash calculation so the GPU side's alloc just returns the correct pointer and the free does nothing.

Edit: And yes, I'll be using Alain's code before it gets anywhere near a PR - current speed is useless.

magnumripper commented 2 weeks ago

I noticed the old code did sc_threads = omp_autotune() and then allocate the memory list according to that, which means the actual number of threads are multiplied with OMP_SCALE. I can't see why that would be needed, or how indeces larger than the thread count would even be addressed? OpenCL is different, I do have to allocate memory for each global thread there - but they execute virtually all in parallel whereas in this case they don't, right?

BTW the notion of memory[THREAD_NUMBER % sc_threads] in the old code has me scratching my head too. That will only ever end up in things like "3 % 32", right? So what is the thing with that modulo?

openwall / john

Argon2: Reintroduce the memory (de)allocation optimization #5558