Open darcywaller opened 1 month ago
I'm guessing something about your modified Network
or the simulated data it's outputting is significantly larger than for one of the default networks?
Try increasing the timeout in _get_data_from_child_err
to 0.05. I've had to do this in the past after scaling up the size of the network.
Ah, thanks for the recommendation @rythorpe. It's possible that's the case because I get more spikes and over a longer time period from the frontal ERP models. I tried 0.05 and then 0.1 for that timeout setting but it's still generating the same error, albeit more slowly.
Just to be sure this is not a memory error, could you try on a computer with more RAM?
MPIBackend is notoriously difficult to debug ... could you add print
statements to check until where the execution works and at what point it fails?
OK, I have an update on this - thank you both for your recommendations. I've linked a code snippet here that when run reproduces the error I'm getting (at least in my MPI environment on OSCAR ). @jasmainak, @rythorpe and I tested this a bit in person last week and determined the following:
After adding print statements to parallel_backends.py and mpi_child.py, I've determined that:
Any recommendations for determining what in the Network object file or Network-related communication in MPI is the problem here? https://gist.github.com/darcywaller/a08389cbae826144a19c87e89d1f3f2d
@darcywaller unfortunately I won't have time to dig into this.
But perhaps you might want to drill down further in _read_net
to understand what is happening? How much of the data is received? Try with 1 core first ... just force the MPIBackend
with 1 core to understand if that works ... then try with 2 cores to see if both cores get the net
object. You can put if
conditions with self.rank == 0
, self.rank == 1
etc to test what is happening in specific cores.
Under the hood, the net
object is serialized (made into a string) using pickle protocol so it can be broadcast to the other cores and then unpickled in the receiving cores. Blake added a string before and after the object "@end_of_net" to recognize that the end of the serialized object ... and extract it using regular expression. Do you get the entire string including "@end_of_net" on the other end? Perhaps something in your new network is interfering with the regular expression from working correctly ... ? You can print out the serialized object etc. ...
Also, as a general comment, it would be helpful to have direct links to the code ... you can click on a line and then click "copy permalink" on github. e.g., _read_net
It seems like this is the same issue that happening with our GitHub linux runners for pytest. #780
The Ubuntu runners are stalling for about 6 hours before being canceled. The Mac runners are working fine. Oscar uses Redhat, so maybe there's something up with Openmpi and linux right now... I'll check if there have been recent updates to any of our mpi dependencies.
I'll try to dig into this soon @darcywaller, but it might be a week or two before I can sit down to debug this properly.
@gtdang I suspect this is a different issue than the one you're referencing because the one @darcywaller encountered still times out. Happy to be wrong though....
@rythorpe No problem, totally understand. I'm on vacation till 6/4 anyway, but am happy to help by starting to try some of @jasmainak's new suggestions when back.
Update - some more troubleshooting determined that MPI was having trouble with unpickling the network, though the pickle and its beginning and end markers were intact. On @ntolley and @rythorpe's suggestion that the partial functions I was using in the network cells might be causing this issue when implementing them in the notebook instead of within hnn-core code, I added the network as a default network on my own hnn-core branch. When importing that network from hnn-core instead, MPI now works, so assumedly that was indeed the issue.
Oh nice, glad you got it working. I'm guessing there's a security feature in pickle
that allows callables to be unpickled only if they originate within local source code and/or a submodule of the parent library. Perhaps the best fix for this bug would be to remove all callables from cell templates. @jasmainak @ntolley any thoughts? Maybe we can tackle this this week?
Hi team, I'm encountering an issue where simulation with MPIBackend is getting caught up somewhere when I am trying to simulate dipoles with a Network I adapted (i.e., isn't one of the default networks in hnn-core). MPIBackend is working fine in the same environment and jupyter notebook with the example from documentation and simulating with the custom network also works fine until I try to use MPIBackend. Any advice on troubleshooting? I can't upload an example notebook here but can provide the full code as needed.
Full text of the error message:
/oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:195: UsterWarning: Timeout exceeded while waiting for child process output. Terminating. . . warn("Timeout exceeded while waiting for child process output."
RuntimeError Traceback (most recent call last) Cell In[10], line 4 2 with MPIBackend(n_procs=2, mpi_cmd='mpiexec'): 3 print("Running simulation with loaded Failed stop parameters") ----> 4 FS_dpls_yesmpi = simulate_dipole(FS_net, tstop=300, n_trials=2) 6 for dpl in FS_dpls_yesmpi: 7 dpl.scale(125).smooth(30)
File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/dipole.py:100, in simulate_dipole(net, tstop, dt, n_trials, record_vsec, record_isec, postproc) 95 if postproc: 96 warnings.warn('The postproc-argument is deprecated and will be removed' 97 ' in a future release of hnn-core. Please define ' 98 'smoothing and scaling explicitly using Dipole methods.', 99 DeprecationWarning) --> 100 dpls = _BACKEND.simulate(net, tstop, dt, n_trials, postproc) 102 return dpls
File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:717, in MPIBackend.simulate(self, net, tstop, dt, n_trials, postproc) 712 print(f"MPI will run {n_trials} trial(s) sequentially by " 713 f"distributing network neurons over {self.n_procs} processes.") 715 env = _get_mpi_env() --> 717 self.proc, sim_data = run_subprocess( 718 command=self.mpi_cmd, obj=[net, tstop, dt, n_trials], timeout=30, 719 proc_queue=self.proc_queue, env=env, cwd=os.getcwd(), 720 universal_newlines=True) 722 dpls = _gather_trial_data(sim_data, net, n_trials, postproc) 723 return dpls
File /oscar/home/ddiesbur/new_mpi_hnn_core/lib64/python3.9/site-packages/hnn_core/parallel_backends.py:233, in run_subprocess(command, obj, timeout, proc_queue, *args, **kwargs) 229 warn("Could not kill python subprocess: PID %d" % proc.pid) 231 if not proc.returncode == 0: 232 # simulation failed with a numeric return code --> 233 raise RuntimeError("MPI simulation failed. Return code: %d" % 234 proc.returncode) 236 child_data = _process_child_data(proc_data_bytes, data_len) 238 # clean up the queue
RuntimeError: MPI simulation failed. Return code: 143