jonescompneurolab / hnn-core

Simulation and optimization of neural circuits for MEG/EEG source estimates
https://jonescompneurolab.github.io/hnn-core/
BSD 3-Clause "New" or "Revised" License
50 stars 50 forks source link

MPI timing out waiting for child process #774

Open darcywaller opened 1 month ago

darcywaller commented 1 month ago

Hi team, I'm encountering an issue where simulation with MPIBackend is getting caught up somewhere when I am trying to simulate dipoles with a Network I adapted (i.e., isn't one of the default networks in hnn-core). MPIBackend is working fine in the same environment and jupyter notebook with the example from documentation and simulating with the custom network also works fine until I try to use MPIBackend. Any advice on troubleshooting? I can't upload an example notebook here but can provide the full code as needed.

Full text of the error message:

/oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:195: UsterWarning: Timeout exceeded while waiting for child process output. Terminating. . . warn("Timeout exceeded while waiting for child process output."

RuntimeError Traceback (most recent call last) Cell In[10], line 4 2 with MPIBackend(n_procs=2, mpi_cmd='mpiexec'): 3 print("Running simulation with loaded Failed stop parameters") ----> 4 FS_dpls_yesmpi = simulate_dipole(FS_net, tstop=300, n_trials=2) 6 for dpl in FS_dpls_yesmpi: 7 dpl.scale(125).smooth(30)

File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/dipole.py:100, in simulate_dipole(net, tstop, dt, n_trials, record_vsec, record_isec, postproc) 95 if postproc: 96 warnings.warn('The postproc-argument is deprecated and will be removed' 97 ' in a future release of hnn-core. Please define ' 98 'smoothing and scaling explicitly using Dipole methods.', 99 DeprecationWarning) --> 100 dpls = _BACKEND.simulate(net, tstop, dt, n_trials, postproc) 102 return dpls

File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:717, in MPIBackend.simulate(self, net, tstop, dt, n_trials, postproc) 712 print(f"MPI will run {n_trials} trial(s) sequentially by " 713 f"distributing network neurons over {self.n_procs} processes.") 715 env = _get_mpi_env() --> 717 self.proc, sim_data = run_subprocess( 718 command=self.mpi_cmd, obj=[net, tstop, dt, n_trials], timeout=30, 719 proc_queue=self.proc_queue, env=env, cwd=os.getcwd(), 720 universal_newlines=True) 722 dpls = _gather_trial_data(sim_data, net, n_trials, postproc) 723 return dpls

File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:233, in run_subprocess(command, obj, timeout, proc_queue, *args, **kwargs) 229 warn("Could not kill python subprocess: PID %d" % proc.pid) 231 if not proc.returncode == 0: 232 # simulation failed with a numeric return code --> 233 raise RuntimeError("MPI simulation failed. Return code: %d" % 234 proc.returncode) 236 child_data = _process_child_data(proc_data_bytes, data_len) 238 # clean up the queue

RuntimeError: MPI simulation failed. Return code: 143

rythorpe commented 1 month ago

I'm guessing something about your modified Network or the simulated data it's outputting is significantly larger than for one of the default networks?

Try increasing the timeout in _get_data_from_child_err to 0.05. I've had to do this in the past after scaling up the size of the network.

darcywaller commented 1 month ago

Ah, thanks for the recommendation @rythorpe. It's possible that's the case because I get more spikes and over a longer time period from the frontal ERP models. I tried 0.05 and then 0.1 for that timeout setting but it's still generating the same error, albeit more slowly.

jasmainak commented 1 month ago

Just to be sure this is not a memory error, could you try on a computer with more RAM?

MPIBackend is notoriously difficult to debug ... could you add print statements to check until where the execution works and at what point it fails?

darcywaller commented 1 month ago

OK, I have an update on this - thank you both for your recommendations. I've linked a code snippet here that when run reproduces the error I'm getting (at least in my MPI environment on OSCAR ). @jasmainak, @rythorpe and I tested this a bit in person last week and determined the following:

After adding print statements to parallel_backends.py and mpi_child.py, I've determined that:

Any recommendations for determining what in the Network object file or Network-related communication in MPI is the problem here? https://gist.github.com/darcywaller/a08389cbae826144a19c87e89d1f3f2d

jasmainak commented 1 month ago

@darcywaller unfortunately I won't have time to dig into this.

But perhaps you might want to drill down further in _read_net to understand what is happening? How much of the data is received? Try with 1 core first ... just force the MPIBackend with 1 core to understand if that works ... then try with 2 cores to see if both cores get the net object. You can put if conditions with self.rank == 0, self.rank == 1 etc to test what is happening in specific cores.

Under the hood, the net object is serialized (made into a string) using pickle protocol so it can be broadcast to the other cores and then unpickled in the receiving cores. Blake added a string before and after the object "@end_of_net" to recognize that the end of the serialized object ... and extract it using regular expression. Do you get the entire string including "@end_of_net" on the other end? Perhaps something in your new network is interfering with the regular expression from working correctly ... ? You can print out the serialized object etc. ...

jasmainak commented 1 month ago

Also, as a general comment, it would be helpful to have direct links to the code ... you can click on a line and then click "copy permalink" on github. e.g., _read_net

gtdang commented 1 month ago

It seems like this is the same issue that happening with our GitHub linux runners for pytest. #780

The Ubuntu runners are stalling for about 6 hours before being canceled. The Mac runners are working fine. Oscar uses Redhat, so maybe there's something up with Openmpi and linux right now... I'll check if there have been recent updates to any of our mpi dependencies.

rythorpe commented 1 month ago

I'll try to dig into this soon @darcywaller, but it might be a week or two before I can sit down to debug this properly.

@gtdang I suspect this is a different issue than the one you're referencing because the one @darcywaller encountered still times out. Happy to be wrong though....

darcywaller commented 1 month ago

@rythorpe No problem, totally understand. I'm on vacation till 6/4 anyway, but am happy to help by starting to try some of @jasmainak's new suggestions when back.

darcywaller commented 3 weeks ago

Update - some more troubleshooting determined that MPI was having trouble with unpickling the network, though the pickle and its beginning and end markers were intact. On @ntolley and @rythorpe's suggestion that the partial functions I was using in the network cells might be causing this issue when implementing them in the notebook instead of within hnn-core code, I added the network as a default network on my own hnn-core branch. When importing that network from hnn-core instead, MPI now works, so assumedly that was indeed the issue.

rythorpe commented 3 weeks ago

Oh nice, glad you got it working. I'm guessing there's a security feature in pickle that allows callables to be unpickled only if they originate within local source code and/or a submodule of the parent library. Perhaps the best fix for this bug would be to remove all callables from cell templates. @jasmainak @ntolley any thoughts? Maybe we can tackle this this week?