jeremyomer / KidneyExchange.jl

MIT License
4 stars 2 forks source link

Timings change when solving more than one instance in a session #19

Closed WPettersson closed 6 months ago

WPettersson commented 7 months ago

I'm running some benchmarks of my own, and I found an oddity when using the KEPTestBP.jl file, or indeed just calling solve_with_BP().

If I start a julia session to just solve the 00036-00000001 using KEPTestBP.jl with a cycle length of 3 and a chain length of 2, I get told that on my system this takes 4.04s total, with 1.7s spent in Process_Node.

Output when solving just 00036-00000001

─────────────────────────────────────────────────────────────────────────────
                                     Time                    Allocations      
                            ───────────────────────   ────────────────────────
      Tot / % measured:          4.04s /  97.9%            345MiB /  98.5%    

 Section            ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 B&P                     1    2.90s   73.3%   2.90s    292MiB   85.8%   292MiB
   Process_Node          1    1.70s   42.9%   1.70s    225MiB   66.1%   225MiB
     opt_master          4    831ms   21.0%   208ms    108MiB   31.9%  27.1MiB
     Bellman-Ford        6    188ms    4.8%  31.4ms   12.5MiB    3.7%  2.08MiB
     IP_master           2   12.6ms    0.3%  6.29ms    574KiB    0.2%   287KiB
 Preprocessing           1    536ms   13.5%   536ms   34.4MiB   10.1%  34.4MiB
 Parser                  1    522ms   13.2%   522ms   13.8MiB    4.1%  13.8MiB
 ─────────────────────────────────────────────────────────────────────────────

However, if I create KEPTestBP-repeat.jl which just duplicates the call to solve_with_BP()

(just a snippet of the file)

bp_params = BP_params()
bp_params.optimizer = "Gurobi"
timer = TimerOutput()
max_time = 7200.0

bp_status, graph_info, subgraph_info = solve_with_BP(filename, cycle_limit, chain_limit, bp_params, timer, max_time)
bp_status, graph_info, subgraph_info = solve_with_BP(filename, cycle_limit, chain_limit, bp_params, timer, max_time)

t_total =  TimerOutputs.time(timer["Parser"]) + TimerOutputs.time(timer["Preprocessing"]) + TimerOutputs.time(timer["B&P"])

and then run this file and look at only the timing detail for the second run, I now see a total of 3.75ms with only 841μs spent in Process_Node. Somehow this second run is significantly faster. Further testing has shown that it's always the first run that is slow, and all future runs are much faster across. I've also tested this on two different computers in case that might cause issues.

 ─────────────────────────────────────────────────────────────────────────────
                                     Time                    Allocations      
                            ───────────────────────   ────────────────────────
      Tot / % measured:         3.75ms /  96.2%            415KiB /  97.0%    

 Section            ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 B&P                     1   3.09ms   85.6%  3.09ms    260KiB   64.7%   260KiB
   Process_Node          1    841μs   23.3%   841μs    109KiB   27.0%   109KiB
     opt_master          4    466μs   12.9%   117μs   29.4KiB    7.3%  7.35KiB
     IP_master           2   10.2μs    0.3%  5.08μs   2.75KiB    0.7%  1.38KiB
     Bellman-Ford        6   8.16μs    0.2%  1.36μs   3.97KiB    1.0%     677B
 Parser                  1    429μs   11.9%   429μs    118KiB   29.3%   118KiB
 Preprocessing           1   90.8μs    2.5%  90.8μs   24.5KiB    6.1%  24.5KiB
 ─────────────────────────────────────────────────────────────────────────────

I decided to see what happens if I solve two different instances, is the second instance also fast or slow? For this, I also downloaded the instance 00036-00000002 from PrefLib, and used the following

(just a snippet of the file)

bp_params = BP_params()
bp_params.optimizer = "Gurobi"
timer = TimerOutput()
max_time = 7200.0

bp_status, graph_info, subgraph_info = solve_with_BP("00036-00000002", cycle_limit, chain_limit, bp_params, timer, max_time)
bp_status, graph_info, subgraph_info = solve_with_BP(filename, cycle_limit, chain_limit, bp_params, timer, max_time)

t_total =  TimerOutputs.time(timer["Parser"]) + TimerOutputs.time(timer["Preprocessing"]) + TimerOutputs.time(timer["B&P"])

The idea here was to solve instance 00036-00000002 to "warm up" (not warm-start in the MIP sense) and see how long the 00036-00000001 instance took to solve. As it turns out, even in this scenario the 00036-00000001 instance only took ~4ms to solve (see below), while the 00036-00000002 instance took around 4 seconds.

 ─────────────────────────────────────────────────────────────────────────────
                                     Time                    Allocations      
                            ───────────────────────   ────────────────────────
      Tot / % measured:         4.11ms /  96.7%            415KiB /  97.0%    

 Section            ncalls     time    %tot     avg     alloc    %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 B&P                     1   3.55ms   89.1%  3.55ms    260KiB   64.7%   260KiB
   Process_Node          1   1.14ms   28.8%  1.14ms    109KiB   27.0%   109KiB
     opt_master          4    657μs   16.5%   164μs   29.4KiB    7.3%  7.35KiB
     IP_master           2   14.0μs    0.4%  6.98μs   2.75KiB    0.7%  1.38KiB
     Bellman-Ford        6   11.6μs    0.3%  1.93μs   3.97KiB    1.0%     677B
 Parser                  1    346μs    8.7%   346μs    118KiB   29.3%   118KiB
 Preprocessing           1   87.1μs    2.2%  87.1μs   24.5KiB    6.1%  24.5KiB
 ─────────────────────────────────────────────────────────────────────────────

At first I suspected perhaps timers weren't properly reset, but I can also confirm from simply watching my terminal output that it does really seem like the first instance to be solved does take a lot longer, and that the second one is blazingly fast, which makes me wonder whether there's some caching that Julia does that isn't being taken into account. However, I'm no Julia expert (this is the first piece of Julia code I've ever used) so I wanted to check to see if there's anything obvious I'm missing.

WPettersson commented 7 months ago

I've done some searching and found pre-compiling tutorials for Julia which may be of use for this package - https://julialang.org/blog/2021/01/precompile_tutorial/ - which may be useful for KidneyExchange.jl. For now, I think I'll just move forward with the slow approach of solving a simple instance before the instance which I wish to have timed.

pnavaro commented 7 months ago

If you use the following line, you can see why

@time bp_status, graph_info, subgraph_info = solve_with_BP(filename, cycle_limit, chain_limit, bp_params, timer, max_time);

4.452510 seconds (3.35 M allocations: 226.498 MiB, 1.72% gc time, 99.53% compilation time: 4% of which was recompilation)

@time bp_status, graph_info, subgraph_info = solve_with_BP(filename, cycle_limit, chain_limit, bp_params, timer, max_time);

0.010202 seconds (11.83 k allocations: 516.062 KiB)

I think it is a good practice to solve a small case to warm-up . I will try to dive into https://julialang.github.io/PrecompileTools.jl/stable

pnavaro commented 6 months ago

I come back here and after thoughts, if you solve a small case before it won't change anything because the compilation time does not depend on the problem size. You create an image with the package using the way explained in this short video https://www.youtube.com/watch?app=desktop&v=_3vJSBk0Bls
I will add the procedure in the documentation.

pnavaro commented 5 months ago

Dear @WPettersson I had the doc to create a sysimage with PackageCompiler to speed-up the computation

https://jeremyomer.github.io/KidneyExchange.jl/dev/sys_image/