Large minimum allocation size in AppendAlloc for SSIDS factor data

mjacobse commented 1 year ago

The AppendAlloc allocator that allocates memory for storing the factor data keeps a Pool of Pages of memory. These pages have a minimum size of 8MB: https://github.com/ralna/spral/blob/70c42191a553756a5fd672e38ad185015c55ab87/src/ssids/cpu/AppendAlloc.hxx#L67-L72 https://github.com/ralna/spral/blob/70c42191a553756a5fd672e38ad185015c55ab87/src/ssids/cpu/AppendAlloc.hxx#L87-L91 So no matter how little memory is required next, if it does not fit on the current top page, an additional 8 MB is allocated and even zero'd: https://github.com/ralna/spral/blob/70c42191a553756a5fd672e38ad185015c55ab87/src/ssids/cpu/AppendAlloc.hxx#L18-L20 I saw this causing unexpectedly large amounts of page faults for tiny matrices and completely dominating the runtime.

Initially I considered to replace the calloc on construction of a Page with a normal malloc and only memset the actually used parts to 0 in Page::allocate. That did help.

But then I wondered what the purpose of this minimum PAGE_SIZE actually is. From the naming it seems like the idea might be to always request a whole memory page/virtual page from the operating system at once? If so, without knowing much about memory pages, I don't believe that this is achieved with the current implementation. The fixed constant of 8 MB might not be the actual page size and malloc does not seem to be well-equipped to help in this task, something like mmap would probably have to be used (see this stackoverflow question). Maybe the idea is just to reduce the amount of calls to malloc in general. But for that to be worth it, in usual matrices, are there really so many factor nodes to allocate separate blocks of memory for?

I ended up running benchmarks with PAGE_SIZE = 8MB vs. PAGE_SIZE = 0MB. I used google/benchmark with this source as runner and ran it on all real symmetric matrices from the SuiteSparse Matrix Collection that did not run into a 2min timeout. Parameters:

--benchmark_min_time=1s, --benchmark_repetitions=10, OMP_NUM_THREADS=1
Ubuntu 22.04, gcc 11.2, AMD Ryzen 7 PRO 4750U

The plots below show the relative runtime of the PAGE_SIZE = 0MB version in % for each example. The bars show the comparison of the fastest measurement across all repetitions whereas the scattered dots show the comparison of all repetitions pairwise. The matrices are given from left to right in ascending order of fastest measured runtime. More plots of the separate groups can be found here and results for the same benchmark run with 4 threads here.

Simply put, for small matrices, PAGE_SIZE = 0 was unsurprisingly often much faster for me. That improvement grows smaller, the larger the matrices become. But even for larger matrices, there does not seem to be a clear advantage of PAGE_SIZE = 8MB. It looks like noise around the baseline to me.

I understand that SSIDS is intended especially for really large matrices and that performance on small ones is probably less of a concern (if at all). But with these observations it seems to me that removing the overallocation might give performance on small matrices for free, without influencing performance on larger matrices. But perhaps I am missing something. Is the benchmark flawed in some way? Is there a benefit of overallocating once the matrices become really large? Are there platforms with a different implementation of malloc that would benefit? Or certain hardware?

Would be happy to hear your thoughts, sorry for the lengthy description.

benchmark_comparison_Boeing benchmark_comparison_all

jfowkes commented 1 year ago

Many thanks for the detailed investigation @mjacobse, it seems to me that this is trying to second-guess the hardware by assuming a minimum page size of 8MB. Now while that may well have been the case ten years ago I see no reason why that should be the case now and I am generally very much against second-guessing the hardware, that is what the OS is for.

From your extensive benchmarks I can see no downside to allowing 0MB page sizes, indeed the improvement is quite substantial for smaller matrices, so I would suggest switching to that as default. @tyronerees your thoughts on this? Also does MA97 do something similar with Page Sizes? If so this could maybe explain the recent memory issues we were seeing.

jhogg41 commented 1 year ago

I don't think PAGE_SIZE was ever meant to correspond to OS page size (which is probably something like 4-16KB depending on the OS). I suspect on systems that support unix-style mmap, you can probably get more performance by replacing this whole mechanism with something that just grows memory space using mmap (which may even zero things for you depending on the exact OS).

Jonathan.

On Mon, Jun 26, 2023 at 8:22 AM Jari @.***> wrote:

Many thanks for the detailed investigation @mjacobse https://github.com/mjacobse, it seems to me that this is trying to second-guess the hardware by assuming a minimum page size of 8MB. Now while that may well have been the case ten years ago I see no reason why that should be the case now and I am generally very much against second-guessing the hardware, that is what the OS is for.

From your extensive benchmarks I can see no downside to allowing 0MB page sizes, indeed the improvement is quite substantial for smaller matrices, so I would suggest switching to that as default. @tyronerees https://github.com/tyronerees your thoughts on this? Also does MA97 do something similar with Page Sizes? If so this could maybe explain the recent memory issues we were seeing.

— Reply to this email directly, view it on GitHub https://github.com/ralna/spral/issues/119#issuecomment-1607712746, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAYXX5R7G7Y35YZKKKSZJ3XNGSMNANCNFSM6AAAAAAZSXK42M . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jfowkes commented 1 year ago

Many thanks @jhogg41 for the clarification! @mjacobse what are your thoughts on an mmap replacement implementation?

mjacobse commented 1 year ago

That would be interesting to try I think, but not really trivial to implement I'd imagine? Never worked with mmap or any memory API more lower-level than malloc personally. But would be platform-dependent code with a potential parallel implementation for Windows I suppose? I think the fallback of just using standard malloc and letting its implementation deal with it should probably stay in any case. And as mentioned I get the impression that not doing manual overallocations might be a more natural default for that.

jfowkes commented 1 year ago

Yes agreed, that feels like an enhancement to the existing codebase, I propose we switch to PAGE_SIZE = 0MB for now.

ralna / spral

Large minimum allocation size in AppendAlloc for SSIDS factor data #119