Closed chryswoods closed 4 years ago
Have confirmed that I don't see a deadlock on OS X when running over 1 process 12 threads, or 2 processes, 12 threads each. Next to check Ubuntu...
Took a while to install Ubuntu in a VM and get everything ready. Can now run MetaWards. Will debug this tomorrow when my brain is sufficiently fresh for parallel debugging ;-)
Ok, I can reproduce this on Ubuntu 20. The code deadlocks if multiple demographics are used when running multiple multiprocessing jobs using more than one thread per run. I haven't seen this on any other OS. A minimal example to repeat is to use the "redblue.json" demographics file from tests/data/redblue.json and then run;
metawards -d ncov -D redblue.json -a ExtraSeedsLondon --nsteps 20 --repeats 2 --nprocs 2 --nthreads 2
this will deadlock, while using --nthreads 1
or using --nprocs 1
works without issue. Equally, using --nprocs 2 --nthreads 2
but removing the -D redblue.json
works without issue. This pinpoints the error to the worker code to specialise the demographics.
Confirmed this is a general Linux issue, and not limited to Ubuntu 20
Fixing in https://github.com/metawards/MetaWards/tree/fix-issue-117. Pinned this down now to somewhere in demographics.specialise when this is running from _worker.py's prepare_worker function
This is looking like a known multiprocessing + OpenMP bug using libgomp on Linux. Issue is that if OpenMP is called before multiprocessing then the fork will mess up the OpenMP variables used to control the thread pool. I think I've only seen this when using the demographics code because specialising is the first time that OpenMP is used before the forking across runs...
https://github.com/pytorch/pytorch/issues/17199 https://bugs.python.org/issue8713
Applied a fix that works on Linux. Now need to make sure that this hasn't broken Mac or Windows...
Can confirm that this fixes the problem on Linux without breaking Windows or OS X. Closing now as this is now merged into devel and will be part of the next release (1.1.1 or 1.2.0 depending on what comes up)
Describe the bug TJ's found a deadlock when running on Ubuntu 20.04
To Reproduce Steps to reproduce the behavior:
Environment (please complete the following information):
Additional context Problem discussed via email with chryswoods. He will try to reproduce locally in an Ubuntu VM and will update progress here. If the fix is quick, then this justifies a 1.1.1 release.