Open samnemo opened 7 months ago
and this is the mpi version: mpiexec --version mpiexec (OpenRTE) 4.0.2
in optuna_parallel.py the nrniv jobs seemed to be going to sleep/getting suspended putting a quit in the right place seemed to allow the later mpiexec with nrniv processes start properly
jobString = f"""#!/bin/bash
echo '{paramLabels}'
echo '{candidate}'
nrniv -python -c 'from neuron import h;soma = h.Section(name="soma");h.psection();quit()'
echo $?
mpiexec -n 48 nrniv -python -c 'from neuron import h;h.nrnmpi_init();pc=h.ParallelContext();print(pc.id())'
echo $?
{command}
"""
Describe the bug
When I run an Optuna batch optimization with the A1 model, mpiexec has trouble running the nrniv processes for the simulation. NetPyNe doesn't check the return calls from subprocess Popen and then waits indefinitely since the output is never produced. It seems that nrniv processes might get started but are put to sleep immediately.
This is using conda on Ubuntu with following relevant packages:
python Python 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.
Reproducing the bug
Steps to reproduce the behavior: Go to the A1 repo/branch here: https://github.com/NathanKlineInstitute/A1/tree/samn
Then run python batch.py
Expected behavior
I expected the mpiexec process to start nrniv properly, but nrniv fails to start. Running the mpiexec command directly runs simulations properly, but once using batch.py/NetPyNe batch with Optuna, nrniv does not start properly.
System information
See above
Additional context
Check with samn or James C for more details on reproducing the bug