Open ozmorph opened 4 years ago
Sounds good! I like having a check as an alternative.
I am surprised too that it did not affect run-time. It could also be that the paging algorithm and Windows allocators are good enough to hide this cost. I'm going to try to replicate the CI binaries on my Windows system and analyze the performance versus my Linux runs to try to determine where the real difference lies.
Good to note about running only 2 tests at once on the CI.
@matt-gretton-dann Using the exact same compiler that GitHub actions is using, I get similar performance on my local machine as Clang 9 on WSL Ubuntu 20.04 does. This led me down a rabbit hole, but I think your past theory about all of the fprintf()
calls being responsible for the slowdown are probably correct.
It appears that I/O performance is a common issue for GitHub actions (especially on Windows):
It appears this project is not the only one to be having this issue. It also explains why Windows performance is so drastically different than Linux and Mac.
To conclude, reducing memory usage anywhere is not going to make the regression tests faster by itself but I think it would provide the extra headroom necessary for more variable tracking in additional simulations in the future.
This led me down a rabbit hole, but I think your past theory about all of the fprintf() calls being responsible for the slowdown are probably correct.
I have a WiP branch which enables us to control output verbosity - and so that should show whether I am right or not.
Objective
Reduce RAM usage in regression tests so that GitHub Action runners can run all 4 tests without paging memory.
Background
The size of the
Results
struct inModel.h
is nearly 805 KB on Clang 9 for Linux, which when accounting for an array ofP.NumSamples
(720
in the regression tests) means thatTSMeanE
andTSVarE
each require an allocation of around 580 MB. As a result, a single regression test will use over 1.1 GB of RAM just for these two arrays alone. When all 4 tests are ran in parallel, they consume 4.4 GB of RAM.Rationale
As noted in #388, the default GitHub action runners have only 7 GB of total RAM and a 14 GB temporary storage device allocated to them. For Linux environments the swap space is set to 4GB and on Windows the swap space is most likely set to 2GB.
Local runs using heaptrack indicate that about 3.5GB is the average heap size of each run of
CovidSim
. If accounting for the fact that the US regression tests run for a fraction of the time that the UK tests do, let's assume that on average 7 GB of RAM is being used by the regression tests and a maximum of 14 GB is used. Windows Server 2019 requires 512MB of RAM to run. If 7 GB is the total amount of RAM given to the runners, this does not leave much room to the operating system itself. Therefore, I surmise thatCovidSim
is hitting paged memory for at least the first half of actest -j6 -V
run.Solution
This pull request removes the
TSMeanE
andTSVarE
arrays from the code since they haven't been used in the final output since the initial squash of commits back in April. The regression tests still pass.Known Concerns
If this code is still used on the super-computers or outside of the regression tests listed on GitHub, then please close this PR and I will open a new PR that documents its use so that someone doesn't attempt to propose this change in the future.
Alternative/Additional Solution
The memory requirements of the
Result
arrays could also be optimized either with heap-allocated C style arrays or std::vectors instead of static double arrays ofMAX_ADUNITS
size (which is set to3200
but the regression tests only use 1 or 2 elements). However this would require far deeper and more extensive changes to the code base than this Pull Request.