neuronsimulator / nrn

NEURON Simulator
http://nrn.readthedocs.io
Other
411 stars 118 forks source link

C++/Python/HOC: NEURON error handling improvements #1857

Open alexsavulescu opened 2 years ago

alexsavulescu commented 2 years ago

Check which combinations of C++/Python/HOC errors in serial/MPI modes are correctly handled and propagated.

Some improvements have been made recently:

Another improvement has been proposed and pending:

There is some of way of executing Python inline in HOC, does that work?

        if (!nrnpython("from neuron import coreneuron")) {
            execerror("Python not available, can not import coreneuron module\n")
        }
        py_obj = new PythonObject()
        py_obj.coreneuron.enable = 1
        py_obj.coreneuron.gpu = use_coreneuron_gpu

do errors here get propagated correctly, for example. gpu routes via a Python setter (https://github.com/neuronsimulator/nrn/blob/f242cc8d4ddefb74f71f7f613a4b7b4043c2b641/share/lib/python/neuron/coreneuron.py#L61-L75) that could throw, for example.

WeinaJi commented 2 years ago

Test script

>cat err_nrnpython.hoc
if (!nrnpython("from neuron import coreneuron")) {
        execerror("Python not available, can not import coreneuron module\n")
    }
objref py_obj
py_obj = new PythonObject()
py_obj.coreneuron.enable = 1
py_obj.coreneuron.cell_permute = 10

And execution in serial:

>nrniv err_nrnpython.hoc
Traceback (most recent call last):
  File "/gpfs/bbp.cscs.ch/home/weji/workdir/nrn/build/install/lib/python/neuron/coreneuron.py", line 111, in cell_permute
    assert value in self._valid_cell_permute()
AssertionError
nrniv: Assignment to PythonObject failed
 in err_nrnpython.hoc near line 7
 py_obj.coreneuron.cell_permute = 10
                                    ^
>echo $?
1

in mpi:

>mpirun -n 2 nrniv err_nrnpython.hoc
Traceback (most recent call last):
  File "/gpfs/bbp.cscs.ch/home/weji/workdir/nrn/build/install/lib/python/neuron/coreneuron.py", line 111, in cell_permute
    assert value in self._valid_cell_permute()
AssertionError
/gpfs/bbp.cscs.ch/home/weji/workdir/nrn/build/install/bin/nrniv: Assignment to PythonObject failed
 in err_nrnpython.hoc near line 7
 py_obj.coreneuron.cell_permute = 10
                                    ^
Traceback (most recent call last):
  File "/gpfs/bbp.cscs.ch/home/weji/workdir/nrn/build/install/lib/python/neuron/coreneuron.py", line 111, in cell_permute
    assert value in self._valid_cell_permute()
AssertionError
/gpfs/bbp.cscs.ch/home/weji/workdir/nrn/build/install/bin/nrniv: Assignment to PythonObject failed
 in err_nrnpython.hoc near line 7
 py_obj.coreneuron.cell_permute = 10
                                    ^
srun: error: r1i4n21: tasks 0-1: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=561639.1
weji@r1i4n21:~/workdir/nrn/build>echo $?
1
WeinaJi commented 2 years ago

PR #1871 is to solve this issue so that it returns properly for nrniv -c. With the fix, in serial

>nrniv -nogui -nobanner -c "1/0"
nrniv: division by zero
 near line 1
 1/0
    ^
nrniv: arg not valid statement: 1/0
 near line 0
 ^
>echo $?
1

in mpi

>mpirun -n 1 nrniv -nogui -nobanner -c "1/0"
/gpfs/bbp.cscs.ch/home/weji/workdir/nrn/build/install/bin/nrniv: division by zero
 near line 1
 1/0
    ^
/gpfs/bbp.cscs.ch/home/weji/workdir/nrn/build/install/bin/nrniv: arg not valid statement: 1/0
 near line 0
 ^
srun: error: r1i5n4: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=566087.0

>echo $?
1
WeinaJi commented 2 years ago

case 3: cross check the python session after PR #1871

>python
Python 3.9.7 (default, Jan 10 2022, 21:17:49) 
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from neuron import h
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
>>> h(1/0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero
>>> h.sqrt(-1)
NEURON: sqrt argument out of domain
 near line 0
 objref hoc_obj_[2]
                   ^
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: hocobj_call error
>>> 

The hoc errors doesn't quit the python session, which is correct.

Helveg commented 2 years ago

@WeinaJi I just stumbled onto another case, but your PR maybe already fixes it:

>>> h.List("A").count()
NEURON: A is not a template name
 near line 0
 objref hoc_obj_[2]
                   ^
        List("A")
Segmentation fault

And:

>>> h.List(h.NetStim())
bad stack access: expecting (double); really (Object *)
NEURON: interpreter stack type error
 near line 0
 objref hoc_obj_[2]
                   ^
        List(...)
Segmentation fault
Helveg commented 2 years ago

@WeinaJi the dissapearing MPI errors that I mentioned during the closing remarks:

Simulated 4521.0/8000.0ms. 15605.28s elapsed. Simulated tick in 3.35. Avg tick 3.4517s
Simulated 4522.0/8000.0ms. 15608.63s elapsed. Simulated tick in 3.35. Avg tick 3.4517s
Rank 0 [Fri Jun 24 01:22:23 2022] [c6-0c2s5n0] application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
srun: error: nid01300: task 0: Aborted (core dumped)

Could NEURON always try to report WHY it is aborting?

nrnhines commented 2 years ago

I can't find any paths to MPI_Abort in NEURON that don't have error messages. I wonder if the MPI_Abort is being called from CoreNEURON. Or if the message is getting lost somehow because the MPI_Abort shuts things down too quickly. Can you add a print statement to nrn/external/coreneuron/coreneuron/mpi/lib/nrnmpi.cpp and src/nrnmpi/nrnmpi.cpp above the MPI_Abort call? That might narrow the search.

Aborted (core dumped)

Did you get a "coredump" or stack trace?

Helveg commented 2 years ago

As the MPI_Abort issue might be slightly more complicated I've continued the report on #1881.

@WeinaJi I'd still be interested to see if the Segmentation faults after HOC errors could be considered; not segfaulting is an "error handling improvement" haha 😉

WeinaJi commented 2 years ago

Hello @Helveg ,

Thanks for reporting the error cases. The seg fault should be fixed by PR https://github.com/neuronsimulator/nrn/pull/1917 . Please let me know if that works for you.