rwth-i6 / sisyphus

A Workflow Manager in Python
Mozilla Public License 2.0
45 stars 24 forks source link

Crash after user interrupt #164

Open albertz opened 10 months ago

albertz commented 10 months ago

Sometimes, but not always (maybe 20% of the cases?), when I hit Ctrl+C, I get this crash:

^C[2023-12-18 18:53:21,090] INFO: Got user interrupt signal stop engine and exit                                                                        [2023-12-18 18:53:21,090] WARNING: Main thread exit. Still running non-daemon threads: {<LocalEngine(Thread-1, started 140176269506112)>}               
[2023-12-18 18:53:21,665] ERROR: Exception in thread <DummyProcess(Thread-12 (worker), started daemon 140175636158016)>:                                [2023-12-18 18:53:21,666] ERROR: Exception in thread <DummyProcess(Thread-18 (worker), started daemon 140175107679808)>:                                
[2023-12-18 18:53:21,734] ERROR: Exception in thread <DummyProcess(Thread-14 (worker), started daemon 140175619372608)>:                                [2023-12-18 18:53:21,734] ERROR: Exception in thread <DummyProcess(Thread-7 (worker), started daemon 140176156243520)>:                                 
[2023-12-18 18:53:21,734] ERROR: Exception in thread <DummyProcess(Thread-6 (worker), started daemon 140176164636224)>:                                 [2023-12-18 18:53:21,776] ERROR: Exception in thread <DummyProcess(Thread-15 (worker), started daemon 140175610979904)>:                                
[2023-12-18 18:53:21,817] ERROR: Exception in thread <DummyProcess(Thread-3 (worker), started daemon 140176189814336)>:                                 
[2023-12-18 18:53:21,858] ERROR: Exception in thread <DummyProcess(Thread-9 (worker), started daemon 140176139458112)>:                                 
[2023-12-18 18:53:21,858] ERROR: Exception in thread <DummyProcess(Thread-4 (worker), started daemon 140176181421632)>:                                 [2023-12-18 18:53:21,859] ERROR: Exception in thread <DummyProcess(Thread-13 (worker), started daemon 140175627765312)>:
EXCEPTION
Traceback (most recent call last):
(Exclude vars because we are exiting.) 
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrap
ped_func
EXCEPTION
Traceback (most recent call last):
[2023-12-18 18:53:21,859] ERROR: Exception in thread <DummyProcess(Thread-11 (worker), started daemon 140175644550720)>:
EXCEPTION
Traceback (most recent call last):
EXCEPTION
    line: return func(*args, **kwargs)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 570, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 547, in SISGraph.for_all_nodes.<locals>.runner
EXCEPTION
(Exclude vars because we are exiting.) 
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrap
ped_func
    line: return func(*args, **kwargs)
EXCEPTION
EXCEPTION
Traceback (most recent call last):
Traceback (most recent call last):
(Exclude vars because we are exiting.) 
EXCEPTION
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/graph.py", line 570, in SISGraph.for_all_nodes.<locals>.runner_helper
    line: runner(path.creator)
EXCEPTION
Traceback (most recent call last):
Traceback (most recent call last):
(Exclude vars because we are exiting.) 
(Exclude vars because we are exiting.) 
...
    line: self._check_running()                                                                                                                         
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running                     
    line: raise ValueError("Pool not running")                                                                                                          ValueError: Pool not running                                                                                                                            
    line: self._check_running()                                                                                                                           File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running                     
Exception ignored in atexit callback: <function shutdown at 0x7f7d659ae5c0>                                                                             
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/pool.py", line 458, in Pool.apply_async                        
    line: self._check_running()                                                                                                                         
EXCEPTION                                                                                                                                               
Traceback (most recent call last):                                                                                                                      
EXCEPTION                                                                                                                                               
Traceback (most recent call last):                                                                                                                      
(Exclude vars because we are exiting.)                                                                                                                  
    line: raise ValueError("Pool not running")                                                                                                          
  File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/pool.py", line 353, in Pool._check_running                     
Exception ignored in sys.unraisablehook: <built-in function unraisablehook>                                                                             (Exclude vars because we are exiting.)                                                                                                                  
  File "/u/zeyer/setups/combined/2021-05-31/tools/sisyphus/sisyphus/tools.py", line 311, in default_handle_exception_interrupt_main_thread.<locals>.wrapped_func                                                                                                                                                
KeyboardInterrupt                                                                                                                                       Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stderr>'> at interpreter shutdown, possibly due to daemon threads                                                                                                                                               
Python runtime state: finalizing (tstate=0x00007f7d668932d8)                                                                                            

Current thread 0x00007f7d66080000 (most recent call first):                                                                                             
  <no Python frame>                                                                                                                                     

Extension modules: psutil._psutil_linux, psutil._psutil_posix, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, n
umpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.utils
, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5p
y.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, markupsafe._speedups, _cffi_backend (total: 41)                                                  
fish: Job 2, '/work/tools/users/zeyer/py-envs…' terminated by signal SIGABRT (Abort)     
albertz commented 10 months ago

The scrambled output means that there are many processes here stopped at the same time by SIGINT.

critias commented 9 months ago

The graph computations are using a ThreadPool (https://github.com/rwth-i6/sisyphus/blob/master/sisyphus/graph.py#L232C12-L232C12). I guess you get this output if you hit Ctrl-C while these computations are running. This problem might go away if you set gs.GRAPH_WORKER=1, but you would also use the multithreading speed up if your filesystem has a higher latency.

albertz commented 9 months ago

Are you saying GRAPH_WORKER=1 is anyway always better and we can remove the old code which handles GRAPH_WORKER>1?

I'm not searching for workarounds. Also, I could simply just ignore this message.

I simply report this because I think it's bad if the process crashes with terminated by signal SIGABRT, and maybe this should be investigated further.

critias commented 9 months ago

No, I'm not saying GRAPH_WORKER=1 is better, it's just a workaround which in most cases makes sisyphus slower.