[OSB] Summary of findings (umbrella issue)

We have completed our review of Health GPS as part of the Open Source Booster programme.

In general, the code appears to be well written and clear and it follows a modern C++ standard, which should make it less bugprone and more portable. There is also good quality technical documentation accompanying it.

We do however have some specific suggestions which we have outlined below. As we only had a single day to look at the code, we were not able to take a detailed look at, for example, performance, so we may be barking up the wrong tree in places. Sorry in advance if so! Hopefully it will be useful for you to have an outsider's perspective on how it was to work with the code and some of the issues will help steer you in the direction of areas for improvement.

More detail is available in the issues we raised, but we have summarised our findings below.

Documentation

There is a decent amount of technical documentation, which, although I didn’t have a chance to look at it in detail, seems to be of good quality. It was rather difficult to find, though, and I had to go through our email chain to find the link again, so I’d recommend sticking a link to it in the README for people who stumble across the GitHub repository.

Although in general the code appears to be clearly written, there is a real lack of in-code documentation describing what individual classes, methods etc. do as well as individual blocks within them, which makes it difficult to follow. As a rule, all public methods etc. should have a comment describing them, preferably in a format that doxygen can parse, so that you can automatically generate API documentation for your users (and possibly publish it on GitHub pages).

See issues:

[x] #68
[x] #69
[x] #70
[x] #71

Problems building the code

It took us some time to build the code properly, probably because we were had subtly different developer setups from @israel-vieira. We’ve opened a few issues, but this is why it can be useful a) to get colleagues to try to build your code and b) to test lots of platforms with your CI (see below).

Specifically, we came across issues building in Visual Studio 2019 (and 2022) on Windows 10 - so MSVC (v142 and v143 toolsets) - with unresolved external symbols in the benchmarking code. We circumvented this by commenting out the benchmarking/testing build option in CMakeLists.txt.

See issues:

[x] #79
[x] #82

CI and code hygiene

We had one error that prevented building the code on gcc v12, the latest version. In general, it is useful to use CI to test that your code works with the latest version of gcc, in addition to whatever you’re using personally, partly so that you don’t have issues further down the line and partly because new versions of gcc often bring new warnings that can help identify problems with your code.

We also tried building your code with clang and there were some minor errors and warnings produced. It is also worth having clang in your CI test setup as it can identify problems not detected by gcc or MSVC. It is also the default compiler on MacOS, so in fixing support for clang, you should make it easier for Mac users. We have also recommended adding support for MacOS, as this should be trivial – likely not involving any changes to the code itself – and would extend the reach of your project.

By targeting more compilers, you can have more assurance that your code is really standards compliant.

Finally, we also recommend a static analysis tool called clang-tidy, which can identify bugs and stylistic issues in C++ code. It could even be integrated with your CI system.

See issues:

[x] #73
[x] #74
[x] #75
[x] #76
[x] #89

Performance

As a language, C++ offers good performance out of the box. However, when writing high-performance code, there are extra considerations needed to make sure you’re getting the most from your hardware. The things we’ve suggested basically fall under the categories of either “vectorisation” and “parallelisation” and we’ll address them separately.

(Note that we didn’t have time to delve into the code in detail, so these might be problems that you’re already aware of, but based on some benchmarking and a cursory examination of the code, these are the areas we think could probably most benefit from improvement.)

Vectorisation

CPUs are at their most efficient when trundling through blocks of numbers that are contiguous in memory, for various reasons. So, for example, if you want to compute the mean of 10,000 numbers, in C++ the best data structure to represent this is almost certainly a std::vector, which allocates memory contiguously. In general, there seemed to be a surprisingly high usage of std::maps in the code base and these data structures are much slower to work with because of the additional guarantees they offer. One guarantee is that the keys are kept sorted even as elements are added to the container, which means that they are slower for many operations than std::unsorted_map, so probably in many cases you could swap your std::maps for std::unsorted_maps to receive an easy performance boost. Ideally, though, you want as much of your data as possible to be contiguous in memory (even if it means filling empty entries with NaNs etc.), so if your CPU is jumping around in memory from one element of a map to the next, that will be a performance bottleneck.

In order to take advantage of modern SIMD instructions (e.g. AVX2 and AVX512) which are important for performance in HPC use cases, you often need to tell the compiler to enable this support if you’re using a bog-standard gcc (compilers on HPC systems often have this enabled anyway), so this is worth bearing in mind and perhaps adding to the documentation.

It seems that there is already an issue open for trialling Apache Arrow (#14), which keeps data in contiguous chunks of memory and may be a useful change for other reasons. We have also suggested Eigen C++ as an alternative which would probably offer similar benefits.

See issues:

[ ] #83
[ ] #88
[x] #77

Parallelisation

When we tried to benchmark the software on a four-core laptop running Linux, CPU utilisation didn’t rise above 25%, which suggested that only one CPU core was in use. Further investigation revealed that the source of the problem was that while parallel versions of STL algorithms were being used (specifically std::for_each) these were not actually running in parallel, because, unfortunately, with gcc in order to get the parallel algorithms to work properly you have to link your program against the TBB library. (This issue is worth fixing in of itself but you definitely want to make sure that you are getting proper parallelisation when running the models on HPC.)

After fixing this bug, CPU utilisation rose to ~30%, which is still pretty low; generally, for an HPC use case you expect to have close to 100% utilisation (I suspect, maybe wrongly, that your problem is a CPU-bound one). This suggests that CPU cores were left waiting around, possibly waiting for IO, but also possibly because they were sitting idle waiting for std::mutexes to become unlocked. It is important to remember that calling lock() on a std::mutex is very cheap when that mutex is unlocked, but expensive when it is locked, so having many CPUs trying to access a locked mutex will be inefficient. Also note that lock contention becomes more of a problem the more threads you have, so this is particularly a problem for running your code on HPC clusters with many CPUs.

The only way of verifying whether or not this is actually the source of the problem is by running benchmarks, but hopefully this is a good starting point.

See issues:

[x] #78
[ ] #80

Resource leaks and debugging

On Windows, we ran some external tools to identify any programming errors (memory related or otherwise).

Deleaker is a standalone application that helps analyse programming errors that are unique to C++ which lead to resource leaks. Happily, Deleaker reported no leaks with the Health-GPS codebase. Interestingly, there were several leaks detected from some external modules which Deleaker excludes from its reporting by default. These are heap memory leaks specific to my system, but also some known leaks from external modules.

Dr. Memory is a memory monitoring tool capable of identifying memory-related programming errors. Dr. Memory takes quite a long time to run, even on the small example data set in the repository. The resulting output is quite extensive and would take more time to fully investigate its findings, but ultimately it identified 33 unique errors. Information for how to configure compiler-specific parameters for Dr. Memory can be found here: https://drmemory.org/page_prep.html.

Pull requests

We also opened a couple of pull requests:

[x] #85
[x] #86

Thanks for your support in using your software and good luck with it!

Yours, Ryan and Alex (@TinyMarsh and @alexdewar)

imperialCHEPI / healthgps