tomgreen66 commented 4 years ago

Interested to see how parallelism was performed in code and ran the code through the Intel Inspector using the Intel Compiler. It seems a few things are highlighted as race conditions. Mainly in SetupModel. For example:

Read a value from here https://github.com/mrc-ide/covid-sim/blob/60da1ecff99001758499292b4751e4fdbb92cfd2/src/SetupModel.cpp#L2222

And decrement it here: https://github.com/mrc-ide/covid-sim/blob/60da1ecff99001758499292b4751e4fdbb92cfd2/src/SetupModel.cpp#L2248

ct seems private in the OpenMP loop but since its a pointer I believe the pointer is protected between threads but the data the pointer is pointing too could be the same. Is that correct? I have a few more issues highlighted but wanted to check if I understand it correctly.

weshinsley commented 4 years ago

So, as you'll have seen the parallelism is all done with very straightforward OpenMP sections, which most C/C++ compilers (MS, Intel, G++) will handle.

The crucial thing about 2248 is that it's inside a #pragma critical section, which therefore, is thread-safe - but not thread-predictable - the results can differ based on which thread gets there first. So, this is a race condition, but a very obvious one that we're content with because... 1) It is thread-safe - it's not a collision. 2) We do want this code to be parallelised for performance. 3) We don't really mind which thread gets there first; the differences caused by this will sometimes cause no difference, or sometimes cause a very slightly different arrangement of people to work-places when the population is initialised. The simulation is stochastic anyway (ie, we generate many different realisations using different random seeds), and the difference caused here by the race merely causes a different (equally valid) realisation. 4) This only occurs in initialising the synthetic population network; after this is done, subsequent runs can use the cached network file, and are deterministic from then on (for a given random seed)

For Intel Inspector, you need to run on a (very) large memory machine so that you can then use the "Locate Deadlocks and Data Races" option with Scope "Extremely thorough". Particularly, this turns off 4-bytes as the default variable size, and uses 1-byte granularity instead. If you don't do that, you get many false positives (e.g, when a pair of 2-byte shorts next to each in memory get safely altered by different threads, but would appear to be a problem if you only considered 4 bytes at a time).

Those settings will locate the three examples of the data race above - all within safe critical sections in that part of SetupModel.cpp, and nothing else. It will take several hours to do so, and several hundred gigabytes of RAM, but that pays for an extremely rigorous survey of the code.

weshinsley commented 4 years ago

(Look for the text "Memory access byte granularity: 1 byte" in the "Details" section of text, in the Intel Inspector analysis)

cjgdev commented 4 years ago

Initially this looked a bit peculiar to me too, but I think I'm with you now. Considering how minimal the effort level is to do it, and the fact that it will not impact performance in a meaningful way, is it not worth just wrapping the reads of ct->S in a critical section the same way that the writes are, if for no other reason than to stop the tools from emitting false positives?

weshinsley commented 4 years ago

@tomgreen66 - yes, you're right - ct itself is thread-local on 2204, different places (schools/workplaces), can have people that live in the same cell (2218-2221), hence thread-local ct for different threads can point to the same thing in memory, and ct->S and ct->susceptible can be the same memory on different threads. The critical sections surrounding writes to these keep things safe, but not deterministic in terms of order. But that's not a problem.

@cjgdev - The races are genuine, not false positives; the warnings and races still happen with your patch, as thread entrances to the critical sections at 2225 and 2240 will still interleave with the change to ct->S on 2249. So no real benefit then of making the reads critical, unless the writes were in the same critical - which would end up with this section basically becoming single-threaded.

Fundamentally, this is just the nature of this part of the algorithm; we are looping through places, with threads grabbing people from cells, to consider assigning to those places.

cjgdev commented 4 years ago

Ok yep, that makes sense. I agree that making the whole section single-threaded would not be a step in the right direction.

weshinsley commented 4 years ago

Thanks, both, for looking at this, and the constructive comments by the way.

cjgdev commented 4 years ago

Happy to help, like @tomgreen66 I'm still ramping up on the code, and more eyes looking doesn't always find new things, but this discussion has been useful.

tomgreen66 commented 4 years ago

Thanks for the detailed discussion. I am a bit of a purist when it comes to OpenMP and like at least an option to have repeatable results when using multiple threads otherwise how would you know a code change elsewhere impacts results with multiple threads. In a previous job OpenMP was even reproducible between single and multi-threaded by avoiding reductions on floating point. Maybe a cppdef could be used to highlight areas of code that are not repeatable when using multiple threads and give the option to remove OpenMP for these parts so at least tests with multiple threads could be run and checked against known results.

My limited knowledge of stochastic processes is that it still should abide by some given laws/rules and using thread races to add "noise" is not ideal.

Background is that I was looking at setting up some training for our University HPC service on Intel Inspector and spotted OpenMP was used in this code and like to learn approaches researchers take so thought it would be a relevant example for users.

weshinsley commented 4 years ago

A couple of thoughts:

Not all genres of algorithm can be parallelised while remaining deterministic. The algorithm we are talking about here in the covid-sim is in the "optimisation" genre of algorithm, where the order of steps contributes to the uniqueness of the answer in a way that cannot be factored out or avoided. (Unlike simple floating-point reductions, where a linear step of sorting the array of numbers before summing them will do the trick - at some cost). For algorithms where deterministic parallelisation is not possible, it's common to validate determinism only in single-threaded operation - which is what we do with our regression tests on each commit.

Multi-threaded testing is a harder problem, and Intel Inspector - when properly configured - does the job very well, so I can happily recommended it for the training you have in mind. We do a fresh exhaustive test on covid-sim whenever changes are pushed that concern threaded sections. Proper configuration, unfortunately, requires maximum resources and overheads, and anything less than that tends to return a lot of false positives - eg, reported memory leaks that are demonstrably not there in hello-world type examples. Usually though, true errors will be detected as well as the false reports, so if you know what to look for, the cheaper analyses of Inspector are still useful.

We could, as you suggest, put a #ifdef around that loop and allow a configuration where a deterministic network can be created, while keeping the rest of the code parallel, but I don't think in practice it would ever get used. That part of the code is only run very rarely to create a first network, and then all runs after save time by loading the cached network. It's important to state that we are not using a thread race to "add noise" - the resulting networks from a single or multi-threaded run are all equally statistically valid, and there are plenty of laws/rules concerning how people are assigned to places. It is appealing to the purist (I include myself as one!) to have everything totally deterministic throughout, but having it take minutes instead of hours to generate different, but statistically equivalent outcomes is an important consideration to weigh up.

Finally, it's important to remember that stochastic runs use RNGs; and parallel stochastic runs will have an RNG per thread. So determinism will only hold when comparing runs generated with the same number of threads.

tomgreen66 commented 4 years ago

This is where knowing the code helps so the impact of this can be ignored. I will experiment more but if multiple threads are not repeating results then I feel its missing a regression test for code changes that occur within OpenMP blocks. There is also another one but i assume could be ignored at: Write https://github.com/mrc-ide/covid-sim/blob/60da1ecff99001758499292b4751e4fdbb92cfd2/src/SetupModel.cpp#L2249 and Read https://github.com/mrc-ide/covid-sim/blob/60da1ecff99001758499292b4751e4fdbb92cfd2/src/SetupModel.cpp#L2226

There is also I think the possibility the critical sections could be made atomic to speed things up a bit since the sections are just incrementing values. If the code isnt used much then not a big deal.

Its great to see the code and I find it a real learning experience to read code and approaches. Dont get to see much OpenMP code so good to stretch that part of my brain. Thanks for the discussion.

weshinsley commented 4 years ago

(EDIT on 11th May): This is a key comment on the issue, listing the single area where thread races can cause different but equivalent stochastic instances of the initialisation network. Single-threading the loop surrounding these lines, or using a previous initialisation file is enough to ensure complete, single or multi-threaded determinism, for given thread count, seed and architecture, of the rest of the codebase.

Line numbers refer to https://github.com/mrc-ide/covid-sim/commit/60da1ecff99001758499292b4751e4fdbb92cfd2 (END EDIT)

Yep, that's basically the same race... there are three in all, with most order permutations occurring in a 32-core run: 1) Lines 2222, 2240 and 2248, regarding ct->S 2) Lines 2226, 2230, 2249, 2258, regarding ct->susecptible and sometimes Places[tp] 3) Line 2251, regarding Places[tp]

pragma omp atomic needs a bit of careful thought, as it only works with very particular statements. 2171, 2248 and 2251 are potential candidates - but I need to check very carefully the semantics of the atomic.

It would be trivial if this were a thread-shared variable being incremented, but I'm not 100% sure whether this will be ok on thread-local variables that point to the same memory from different threads. My gut feeling is that critical will be safer - but I will make a note of this and look into it further; as you say, it might give a little performance improvement.

weshinsley commented 4 years ago

I also agree with you on having multi-threaded regression tests; in practice that's quite hard to automate, since to press the races to happen, you need a lot of time and resources, so for now that remains a manual step we do, rather than an automatic regression test.

tomgreen66 commented 4 years ago

Thanks. Just to update. Just did some timings with 2 threads and removed openmp from that section of code and I dont see a performance drop (might actually improve) since data races could also harm performance. I also see in the regression testing no change in the output when run twice with the change so looks like the output isnt that sensitive to multiple threads where it was before. See code changes in omp_branch

weshinsley commented 4 years ago

Yes - certainly not linear over threads, and there will be big overheard for small thread counts. With a much larger country and 32 cores (our usual config), the benefits in real time will stand out more. If ever I get time, I will benchmark this some more. But again, remember this is only done once...

tomgreen66 commented 4 years ago

Thanks for taking time to look. Be good to see what impact removing OpenMP on the bigger models.
Still need to be convinced that the data race provides an equally valid realisation. Might have ranges wrong here but will try to explain my reasoning. m is calculated being random between 0 and ct->S-1 (assuming ranf is range [0,1)) - then a variable s is calculated with a possible different value of ct->S and that value of s is used to decide whether to decrement another possible different ct->S and store in l and then a check m is less than l. If any assumptions or relationships such as:

maybe l is expected to be in range 0 to ct->S-1 but might be in range 0 to ct->S-1-X where X is the number of times other threads have decremeted `ct->S
s is expected to be related to m due to both depending on ct->S but s will use ct->S-X with X the number of decrements. Even using ranf in calculations this might bias ever so slightly the relationship. Any of those would make me want to just remove the race. As you say very little impact on the final results but parallelising for parallelising sake on code that is inherently not suitable is hard for me to understand. I better get back to writing documentation on using Intel Inspector! Thanks again for the discussion.

Feynstein commented 4 years ago

The way we usually handle problems with thread repeatability is implementing a safe parallel thread for statement. Something that can easily be used across the whole code for safe thread access using different multi-threading-safe methods that I'm not very familiar with, but I know it can be done. That way you normalize the use of multi-threading everything is neatly doing it's job. https://docs.microsoft.com/en-us/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism

chanhosuh commented 4 years ago

This is a very useful discussion. Might I suggest pinning it at the top for more visibility? (GitHub allows three pinned issues). The non-determinism criticisms have gotten a lot of public attention.

weshinsley commented 4 years ago

Good suggestion, thank you @chanhosuh, I've done so.

weshinsley commented 4 years ago

Due to likely attention on this thread, I am locking it to keep it a tidy reference.

weshinsley commented 4 years ago

FURTHER UPDATE: Since misunderstanding and misuse continues with this, PR #272 makes the above section single-threaded, causing no significant change to results - (just a different stochastic instance of network at first initialisation), but determinism throughout the entire codebase.

weshinsley commented 4 years ago

Apologies for renaming this thread again, but hope the contributors to this really helpful discussion will understand why.

mrc-ide / covid-sim

The codebase is deterministic. #161

pragma omp atomic needs a bit of careful thought, as it only works with very particular statements. 2171, 2248 and 2251 are potential candidates - but I need to check very carefully the semantics of the atomic.