vermaseren / form

The FORM project for symbolic manipulation of very big expressions
GNU General Public License v3.0
1.15k stars 136 forks source link

Reallocate large+small buffer after each module #537

Open jodavies opened 3 months ago

jodavies commented 3 months ago

This is a test of a request from @msloechner , which makes FORM's resident set size better track its actual memory usage. This comes on top of the split allocations branch, since it only deals with the largest buffer. Since FORM allocates its buffers and keeps them for the whole run, the apparent memory usage remains constant after a "peak". If the OS is under memory pressure it will swap pages, but they might be pages that FORM actually isn't using any more.

It seems there are two ways to do it: free and re-allocate the buffer, or use madvise to specify MADV_DONTNEED. Both have the same effect and have about the same performance, but there is an impact of more than 10% compared to doing nothing. This shouldn't be done by default, and maybe never "every module". (However, the longer each module takes, with lots of disk access during sorting, etc, the less the performance impact will be).

The options could be, a preprocessor statement like #reallocatesort which must be specified when the user really wants this (say, after a heavy module) or maybe just On reallocatesort; to enable it for every module which follows. Any thoughts?

@msloechner , can you test this and see if it helps your multiple concurrent jobs scenario? I could also see it being useful when running FORM on cluster machines which kill jobs based on RSS: both of these commits have a lower peak RSS in my tests compared to the original behaviour.

Here is a small MINCER performance test for orig (unmodified form), split (checking the split allocations have no impact) realloc (the first commit) and madv (the second):

Benchmark 1: nice -n -10 ../bin/tform-test-SR-orig -w12 calcdia.frm > calcdia.log1
  Time (mean ± σ):     16.686 s ±  0.216 s    [User: 152.647 s, System: 1.269 s]
  Range (min … max):   16.252 s … 16.909 s    8 runs

Benchmark 2: nice -n -10 ../bin/tform-test-SR-split -w12 calcdia.frm > calcdia.log2
  Time (mean ± σ):     16.718 s ±  0.320 s    [User: 152.869 s, System: 1.197 s]
  Range (min … max):   16.166 s … 17.113 s    8 runs

Benchmark 3: nice -n -10 ../bin/tform-test-SR-realloc -w12 calcdia.frm > calcdia.log3
  Time (mean ± σ):     18.932 s ±  0.154 s    [User: 156.316 s, System: 6.020 s]
  Range (min … max):   18.730 s … 19.200 s    8 runs

Benchmark 4: nice -n -10 ../bin/tform-test-SR-madv -w12 calcdia.frm > calcdia.log4
  Time (mean ± σ):     18.942 s ±  0.248 s    [User: 157.591 s, System: 5.870 s]
  Range (min … max):   18.585 s … 19.308 s    8 runs

Summary
  nice -n -10 ../bin/tform-test-SR-orig -w12 calcdia.frm > calcdia.log1 ran
    1.00 ± 0.02 times faster than nice -n -10 ../bin/tform-test-SR-split -w12 calcdia.frm > calcdia.log2
    1.13 ± 0.02 times faster than nice -n -10 ../bin/tform-test-SR-realloc -w12 calcdia.frm > calcdia.log3
    1.14 ± 0.02 times faster than nice -n -10 ../bin/tform-test-SR-madv -w12 calcdia.frm > calcdia.log4

Here is the RSS profile of this test: rss

msloechner commented 3 months ago

Thanks a lot, @jodavies . This looks indeed promising. I will look into it when I find a bit of time.

jodavies commented 3 months ago

In case it is useful to see how your own scripts behave, I logged the RSS during the run with something like

../bin/tform-test-SR-split -w8 calcdia.frm > calcdia.log1 &
pid=$!
while kill -0 $pid; do
   ps --pid $pid -o rss=
   sleep 0.2
done > tform-rss-orig.dat
jodavies commented 3 months ago

I added an On/Off switch for this, I think it is better than just a preprocessor instruction since the option can easily be enabled for multiple modules (nice when calling procedures etc) and can still be used for a "single module" (by turning it off again in the next).

I think the "free and reallocate" version is simpler in that we don't need to worry about aligned memory allocations and whether or not this works in Windows.

Edit: I don't know what is going on with valgrind on the runners these days, I see a lot of:

valgrind:  Fatal error at startup: a function redirection
valgrind:  which is mandatory for this platform-tool combination
valgrind:  cannot be set up.  Details of the redirection are:
valgrind:  
valgrind:  A must-be-redirected function
valgrind:  whose name matches the pattern:      strlen
valgrind:  in an object with soname matching:   ld-linux-x86-64.so.2
valgrind:  was not found whilst processing
valgrind:  symbols from the object with soname: ld-linux-x86-64.so.2
coveralls commented 3 months ago

Coverage Status

coverage: 49.981% (-0.02%) from 49.999% when pulling 6a5c6d5d3a50a6c7fbe89618d971fc5b7966b92f on jodavies:sort-realloc into 83e3d4185efca2e5938c665a6df9d67d6d9492ca on vermaseren:master.

msloechner commented 3 months ago

I've run an example for my use case, it looks quite suitable: I don't see any significant loss in performance in terms of CPU-time in my case, while benefiting from the reduction in resident set size.

For me this improvement will be mainly of importance in (single-threaded) form, where the time spent in single modules greatly exceeds the time needed for sorting and allocating memory, but which do not benefit from tform parallelisation too much. But I see the benefit also for tform as soon as it's able to combine a lot of single terms in a .sort.

Note that in the example below the ratio between the runtimes after and before the peak in memory consumption is rather low, in other cases (that would take too long to benchmark) the ratio might be around 5-10. RSS

msloechner commented 3 months ago

@jodavies Maybe it would be great to also have a preprocessor command #sortreallocate to be active only for a single module. Could you possibly add this?

jodavies commented 3 months ago

Do you strongly prefer that to doing

On sortreallocate;
...
.sort
Off sortreallocate;

?

Of course it is straightforward to have both options, so I can add a #sortreallocate too.

msloechner commented 3 months ago

I think it would be a nice feature when you're dynamically generating code with the preprocessor and happen to not have access to the previous .sort statement (but don't want to sort right now). What do you think?

coveralls commented 3 months ago

Coverage Status

coverage: 49.967% (-0.03%) from 49.999% when pulling 5302a8d6eb54dcfde75682645b4a4a5ee9fccd0b on jodavies:sort-realloc into 83e3d4185efca2e5938c665a6df9d67d6d9492ca on vermaseren:master.