metawards / MetaWards

MetaWards disease metapopulation analysis and modelling software. Professional geographical SIR model with a flexible plugin architecture to support complex scenario modelling
https://metawards.org
GNU General Public License v3.0
13 stars 6 forks source link

[BUG] - Deadlock in OpenMP on Ubuntu 20.04 #117

Closed chryswoods closed 4 years ago

chryswoods commented 4 years ago

Describe the bug TJ's found a deadlock when running on Ubuntu 20.04

To Reproduce Steps to reproduce the behavior:

  1. Run metawards using input files he has provided with nthreads > 1 and nprocs > 1
  2. MetaWards will hang while running the model runs, with processor usage dropping to 100% Expected behavior Expect MetaWards to complete the run and not hang ;-)

Environment (please complete the following information):

Additional context Problem discussed via email with chryswoods. He will try to reproduce locally in an Ubuntu VM and will update progress here. If the fix is quick, then this justifies a 1.1.1 release.

chryswoods commented 4 years ago

Have confirmed that I don't see a deadlock on OS X when running over 1 process 12 threads, or 2 processes, 12 threads each. Next to check Ubuntu...

chryswoods commented 4 years ago

Took a while to install Ubuntu in a VM and get everything ready. Can now run MetaWards. Will debug this tomorrow when my brain is sufficiently fresh for parallel debugging ;-)

chryswoods commented 4 years ago

Ok, I can reproduce this on Ubuntu 20. The code deadlocks if multiple demographics are used when running multiple multiprocessing jobs using more than one thread per run. I haven't seen this on any other OS. A minimal example to repeat is to use the "redblue.json" demographics file from tests/data/redblue.json and then run;

metawards -d ncov -D redblue.json -a ExtraSeedsLondon --nsteps 20 --repeats 2 --nprocs 2 --nthreads 2 

this will deadlock, while using --nthreads 1 or using --nprocs 1 works without issue. Equally, using --nprocs 2 --nthreads 2 but removing the -D redblue.json works without issue. This pinpoints the error to the worker code to specialise the demographics.

chryswoods commented 4 years ago

Confirmed this is a general Linux issue, and not limited to Ubuntu 20

chryswoods commented 4 years ago

Fixing in https://github.com/metawards/MetaWards/tree/fix-issue-117. Pinned this down now to somewhere in demographics.specialise when this is running from _worker.py's prepare_worker function

chryswoods commented 4 years ago

This is looking like a known multiprocessing + OpenMP bug using libgomp on Linux. Issue is that if OpenMP is called before multiprocessing then the fork will mess up the OpenMP variables used to control the thread pool. I think I've only seen this when using the demographics code because specialising is the first time that OpenMP is used before the forking across runs...

https://github.com/pytorch/pytorch/issues/17199 https://bugs.python.org/issue8713

chryswoods commented 4 years ago

Applied a fix that works on Linux. Now need to make sure that this hasn't broken Mac or Windows...

chryswoods commented 4 years ago

Can confirm that this fixes the problem on Linux without breaking Windows or OS X. Closing now as this is now merged into devel and will be part of the next release (1.1.1 or 1.2.0 depending on what comes up)