OSError: [Errno 116] Stale file handle

wjlei1990 commented 3 years ago

Hi,

I have 4 large jobs waiting in the queue on Summit. They have been sitting in the queue for a while, due to their large job size.

I noticed 3 of 4 jobs kept showing error messages below, repetatively.

--- Logging error ---
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/logging/__init__.py", line 1029, in emit
    self.flush()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/logging/__init__.py", line 1009, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/utils/component.py", line 939, in run
    ret = self._cb()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 264, in _pilot_heartbeat_cb
    self._pilot_send_hb()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 407, in _pilot_send_hb
    self._session._dbs.pilot_command('heartbeat', {'pmgr': self._uid}, pid)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/db/database.py", line 244, in pilot_command
    self._log.debug('insert cmd: %s %s %s', pids, cmd, arg)
Message: 'insert cmd: %s %s %s'
Arguments: (None, 'heartbeat', {'pmgr': 'pmgr.0000'})

andre-merzky commented 3 years ago

Do you have an estimate on how long the jobs were sitting in the queue? We have not seen that error mode before, it looks like the Python logging layer looses access to the log file handles :-/ I'll check if we missed some documented limits.

wjlei1990 commented 3 years ago

Hi @andre, about 3-4 days.

Now the python script is killed:

--- Logging error ---
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/logging/__init__.py", line 1029, in emit
    self.flush()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/logging/__init__.py", line 1009, in flush
    self.stream.flush()
OSError: [Errno 116] Stale file handle
Call stack:
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/utils/component.py", line 939, in run
    ret = self._cb()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 264, in _pilot_heartbeat_cb
    self._pilot_send_hb()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 407, in _pilot_send_hb
    self._session._dbs.pilot_command('heartbeat', {'pmgr': self._uid}, pid)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/db/database.py", line 244, in pilot_command
    self._log.debug('insert cmd: %s %s %s', pids, cmd, arg)
Message: 'insert cmd: %s %s %s'
Arguments: (None, 'heartbeat', {'pmgr': 'pmgr.0000'})
run.bash: line 6: 106270 Killed                  python run_entk.py perturb_0.0125

However, on the Summit lsf job queue, the job still exists. Should I cancel the job manually?

If I leave it there, when the job is running, will it run properly?

wjlei1990 commented 3 years ago

Now there is a job running on Summit...but this job seems doing nothing...not computing anything...

I am going to kill it manually...

But I am waiting for ENTK team to comment further!

andre-merzky commented 3 years ago

Sorry for the late response!

Indeed, the pilot job will start, but EnTK delays the submission of tasks until it learns about the job startup, so if the client side is gone, then no tasks get submitted to the pilot :-/

I did not find anything related to logging handler timeouts in Python, and have no idea what happens there. Do I read your post correctly that one of the 4 jobs succeeded?

wjlei1990 commented 3 years ago

actually I kill them all manually, since I feel that one will not survive...

So I have no idea if it will be successful or not.

andre-merzky commented 3 years ago

I am really sorry, but I have no idea yet what to do here and how to debug. That is not supposed to happen.

One question though: did all jobs fail on the same log message, or on random ones? If you don't know, would you mind submitting a couple of small but long running tests to check? Maybe the log error is a red herring and the error is related to code nearby, not to the logging...

radical-collaboration / hpc-workflows

OSError: [Errno 116] Stale file handle #148