Problems using non-'soma' values for record_isec argument in simulate_dipole()

darcywaller commented 10 months ago

Hi all, I'm encountering a series of issues when trying to use the record_isec optional argument in simulate_dipole() to return i values from non-soma segments.

1) Running SS_dpls = simulate_dipole(SS_net, tstop=500, n_trials=5, record_isec='soma') works, but setting record_isec equal to 'all' crashes the kernel of the jupyter notebook I'm running it in. I undersand isec has more values than vsec (i.e., vectors for ALL synaptic currents rather than just voltage) and this doesn't happen with just 1 trial, so I assume this is a matter of memory overload. 2) However, I don't seem to be able to save outputs from a different segment by name to prevent this (e.g., 'basal_1') - this raises "Invalid value for the 'record_isec' parameter. Allowed values are 'all', 'soma', and False, but got 'basal_1' instead.". 3) Also, when simulating one trial with record_isec = 'all', the segment keys in SS_net.cell_response.isec[gid] only include 'soma', and none of the other segment names.

I'm particularly interested in saving the GABAb current from apical1, so let me know if there's another way I should be going about saving and accessing current values in other segments for one or multiple trials.

jasmainak commented 10 months ago

what you are requesting is probably going to require some combination of bugfix + enhancement.

Here is the diff of the original pull request: https://github.com/jonescompneurolab/hnn-core/pull/502/files by @ntolley . Can you see from there what could be the issue?

darcywaller commented 10 months ago

Sure. I understand if I need to find a better way to extract and save that variable per trial to avoid the memory issue as long as we fix the bug that's preventing it from getting returned on single trials.

Given that SS_dpls = simulate_dipole(SS_net, tstop=500, n_trials=1, record_vsec='all') is also returning a SS_net.cell_response.vsec dictionary with only 'soma' and no other segments, could there be a problem arising from line 551, section_names = list(self.sections.keys()), in referencing sections instead of _nrn_sections defined in line 313? In general, it looks like the only key getting iterated and saved for both isec and vsec when record is 'all' is still 'soma'.

darcywaller commented 10 months ago

Nevermind - I was testing with low gid indices, so I was checking for non-existent segments in the interneurons rather than pyramidal cells. Sorry for the confusion. This is then just an enhancement issue wrt the memory issue, so I will close the issue.

jasmainak commented 10 months ago

@darcywaller glad you could figure it out. The code is written in a pretty general way ... so if you want to implement the feature to record from specific segments (instead of all), it might not be that difficult, specially with some local help from @ntolley !

ntolley commented 10 months ago

We chatted about some strategies to deal with the memory issues!

Now I'm remembering why I decided not to implement passing something like record_isec='apical_tuft_ampa' as an option. Since this only applies to certain cells, the logic would have to be a little more complex under the hood.

But given the potential memory issues if you want something else it might be worth revisiting (I'm sure there's an elegant way to write up that enhacement :wink: )

rythorpe commented 10 months ago

Which ParallelBackend are you using @darcywaller? I'm just trying to diagnose whether its a memory issue or process timeout issue.

darcywaller commented 10 months ago

@rythorpe good point. MPIBackend with 4 cores was what I was using originally, but got the error "MPI simulation failed. Return code: 137". Then, simulating 5 trials while record_isec = 'all' without a ParallelBackend was what crashed my jupyter kernel.

jasmainak commented 10 months ago

Does JoblibBackend throw the same error? It might be slower but worth checking ...

also can you share the full traceback?

darcywaller commented 10 months ago

JoblibBackend also gets killed, traceback below: Running simulation with loaded Successful stop parameters Joblib will run 5 trial(s) in parallel by distributing trials over 4 jobs.

TerminatedWorkerError Traceback (most recent call last) /tmp/ipykernel_42875/3558491913.py in 4 with JoblibBackend(n_jobs=4): # takes a little while to initialize! 5 print("Running simulation with loaded Successful stop parameters") ----> 6 SS_dpls = simulate_dipole(SS_net, tstop=500, n_trials=5, record_isec='all') 7 8 for dpl in SS_dpls:

/oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/dipole.py in simulate_dipole(net, tstop, dt, n_trials, record_vsec, record_isec, postproc) 98 'smoothing and scaling explicitly using Dipole methods.', 99 DeprecationWarning) --> 100 dpls = _BACKEND.simulate(net, tstop, dt, n_trials, postproc) 101 102 return dpls

/oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/parallel_backends.py in simulate(self, net, tstop, dt, n_trials, postproc) 557 parallel, myfunc = self._parallel_func(_simulate_single_trial) 558 sim_data = parallel(myfunc(net, tstop, dt, trial_idx) for --> 559 trial_idx in range(n_trials)) 560 561 dpls = _gather_trial_data(sim_data, net=net, n_trials=n_trials,

/oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/joblib/parallel.py in call(self, iterable) 1096 1097 with self._backend.retrieval_context(): -> 1098 self.retrieve() 1099 # Make sure that we get a last message telling us we are done 1100 elapsed_time = time.time() - self._start_time

/oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self) 973 try: 974 if getattr(self._backend, 'supports_timeout', False): --> 975 self._output.extend(job.get(timeout=self.timeout)) 976 else: 977 self._output.extend(job.get())

/oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout) 565 AsyncResults.get from multiprocessing.""" 566 try: --> 567 return future.result(timeout=timeout) 568 except CfTimeoutError as e: 569 raise TimeoutError from e

/gpfs/runtime/opt/python/3.7.4/lib/python3.7/concurrent/futures/_base.py in result(self, timeout) 433 raise CancelledError() 434 elif self._state == FINISHED: --> 435 return self.__get_result() 436 else: 437 raise TimeoutError()

/gpfs/runtime/opt/python/3.7.4/lib/python3.7/concurrent/futures/_base.py in __get_result(self) 382 def __get_result(self): 383 if self._exception: --> 384 raise self._exception 385 else: 386 return self._result

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

Here's the traceback with MPIBackend():

Running simulation with loaded Successful stop parameters MPI will run 5 trial(s) sequentially by distributing network neurons over 4 processes. numprocs=4

A process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged.

The process that invoked fork was:

Local host: [[22323,1],0] (PID 43009)

If you are absolutely sure that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0.

Loading custom mechanism files from /oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/mod/x86_64/libnrnmech.so Building the NEURON model Loading custom mechanism files from /oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/mod/x86_64/libnrnmech.so Loading custom mechanism files from /oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/mod/x86_64/libnrnmech.so Loading custom mechanism files from /oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/mod/x86_64/libnrnmech.so [Done] [node1821.oscar.ccv.brown.edu:43003] 3 more processes have sent help message help-opal-runtime.txt / opal_init:warn-fork [node1821.oscar.ccv.brown.edu:43003] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Trial 1: 0.03 ms... Trial 1: 10.0 ms... Trial 1: 20.0 ms... Trial 1: 30.0 ms... Trial 1: 40.0 ms... Trial 1: 50.0 ms... Trial 1: 60.0 ms... Trial 1: 70.0 ms... Trial 1: 80.0 ms... Trial 1: 90.0 ms... Trial 1: 100.0 ms... Trial 1: 110.0 ms... Trial 1: 120.0 ms... Trial 1: 130.0 ms... Trial 1: 140.0 ms... Trial 1: 150.0 ms... Trial 1: 160.0 ms... Trial 1: 170.0 ms... Trial 1: 180.0 ms... Trial 1: 190.0 ms... Trial 1: 200.0 ms... Trial 1: 210.0 ms... Trial 1: 220.0 ms... Trial 1: 230.0 ms... Trial 1: 240.0 ms... Trial 1: 250.0 ms... Trial 1: 260.0 ms... Trial 1: 270.0 ms... Trial 1: 280.0 ms... Trial 1: 290.0 ms... Trial 1: 300.0 ms... Trial 1: 310.0 ms... Trial 1: 320.0 ms... Trial 1: 330.0 ms... Trial 1: 340.0 ms... Trial 1: 350.0 ms... Trial 1: 360.0 ms... Trial 1: 370.0 ms... Trial 1: 380.0 ms... Trial 1: 390.0 ms... Trial 1: 400.0 ms... Trial 1: 410.0 ms... Trial 1: 420.0 ms... Trial 1: 430.0 ms... Trial 1: 440.0 ms... Trial 1: 450.0 ms... Trial 1: 460.0 ms... Trial 1: 470.0 ms... Trial 1: 480.0 ms... Trial 1: 490.0 ms... Building the NEURON model [Done] Trial 2: 0.03 ms... Trial 2: 10.0 ms... Trial 2: 20.0 ms... Trial 2: 30.0 ms... Trial 2: 40.0 ms... Trial 2: 50.0 ms... Trial 2: 60.0 ms... Trial 2: 70.0 ms... Trial 2: 80.0 ms... Trial 2: 90.0 ms... Trial 2: 100.0 ms... Trial 2: 110.0 ms... Trial 2: 120.0 ms... Trial 2: 130.0 ms... Trial 2: 140.0 ms... Trial 2: 150.0 ms... Trial 2: 160.0 ms... Trial 2: 170.0 ms... Trial 2: 180.0 ms... Trial 2: 190.0 ms... Trial 2: 200.0 ms... Trial 2: 210.0 ms... Trial 2: 220.0 ms... Trial 2: 230.0 ms... Trial 2: 240.0 ms... Trial 2: 250.0 ms... Trial 2: 260.0 ms... Trial 2: 270.0 ms... Trial 2: 280.0 ms... Trial 2: 290.0 ms... Trial 2: 300.0 ms... Trial 2: 310.0 ms... Trial 2: 320.0 ms... Trial 2: 330.0 ms... Trial 2: 340.0 ms... Trial 2: 350.0 ms... Trial 2: 360.0 ms... Trial 2: 370.0 ms... Trial 2: 380.0 ms... Trial 2: 390.0 ms... Trial 2: 400.0 ms... Trial 2: 410.0 ms... Trial 2: 420.0 ms... Trial 2: 430.0 ms... Trial 2: 440.0 ms... Trial 2: 450.0 ms... Trial 2: 460.0 ms... Trial 2: 470.0 ms... Trial 2: 480.0 ms... Trial 2: 490.0 ms...

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpiexec noticed that process rank 0 with PID 43009 on node node1821 exited on signal 9 (Killed).

/oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/parallel_backends.py:167: UserWarning: Child process failed unexpectedly warn("Child process failed unexpectedly")

RuntimeError Traceback (most recent call last) /tmp/ipykernel_42875/3071245237.py in 4 with MPIBackend(n_procs=4): # takes a little while to initialize! 5 print("Running simulation with loaded Successful stop parameters") ----> 6 SS_dpls = simulate_dipole(SS_net, tstop=500, n_trials=5, record_isec='all') 7 8 for dpl in SS_dpls:

/oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/dipole.py in simulate_dipole(net, tstop, dt, n_trials, record_vsec, record_isec, postproc) 98 'smoothing and scaling explicitly using Dipole methods.', 99 DeprecationWarning) --> 100 dpls = _BACKEND.simulate(net, tstop, dt, n_trials, postproc) 101 102 return dpls

/oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/parallel_backends.py in simulate(self, net, tstop, dt, n_trials, postproc) 718 command=self.mpi_cmd, obj=[net, tstop, dt, n_trials], timeout=30, 719 proc_queue=self.proc_queue, env=env, cwd=os.getcwd(), --> 720 universal_newlines=True) 721 722 dpls = _gather_trial_data(sim_data, net, n_trials, postproc)

/oscar/home/ddiesbur/hnn-core/lib/python3.7/site-packages/hnn_core/parallel_backends.py in run_subprocess(command, obj, timeout, proc_queue, *args, **kwargs) 232 # simulation failed with a numeric return code 233 raise RuntimeError("MPI simulation failed. Return code: %d" % --> 234 proc.returncode) 235 236 child_data = _process_child_data(proc_data_bytes, data_len)

RuntimeError: MPI simulation failed. Return code: 137

rythorpe commented 10 months ago

That definitely looks like a memory issue. I'm assuming the same thing happens when running a single trial with 1 job using JoblibBackend?

darcywaller commented 10 months ago

Actually, no, apparently with JoblibBackend(n_jobs=1): SS_dpls = simulate_dipole(SS_net, tstop=500, n_trials=1, record_isec='all') completes without an error.

jasmainak commented 10 months ago

how much RAM are you working with ? joblib definitely clones the data, so the more cores you use, the more memory you are going to consume ... it might just scale linearly

jonescompneurolab / hnn-core