radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Task stuck in the `agent_staging_output` #3211

Closed mtitov closed 3 months ago

mtitov commented 4 months ago

At the time when logs were checked task has been finished for 1 hour and agent_staging_output module had this tracing event as the last one for this task (didn't progress further):

$ tail -n 3 ./agent_staging_output.0000.prof
1720657683.6917439,staging_stdout_start,agent_staging_output.0000,Thread-1,m1.i0.s2.mldocking.0.0000,,
1720657683.7218523,staging_stdout_stop,agent_staging_output.0000,Thread-1,m1.i0.s2.mldocking.0.0000,,
1720657683.7219148,staging_stderr_start,agent_staging_output.0000,Thread-1,m1.i0.s2.mldocking.0.0000,,
$ ll ./agent_staging_output.0000.*
-rw-r--r-- 1 matitov chm155       0 Jul 10 16:11 ./agent_staging_output.0000.err
-rw-r--r-- 1 matitov chm155    7554 Jul 10 20:28 ./agent_staging_output.0000.log
-rw-r--r-- 1 matitov chm155       0 Jul 10 16:11 ./agent_staging_output.0000.out
-rw-r--r-- 1 matitov chm155 8655581 Jul 10 20:28 ./agent_staging_output.0000.prof

Task's stderr file is relatively large (but seems that stdout file was processed just fine)

$ ll m1.i0.s2.mldocking.0.0000/
total 49608
-rw-r--r-- 1 matitov chm155 15227965 Jul 10 20:27 m1.i0.s2.mldocking.0.0000.err
-rwxr--r-- 1 matitov chm155     2443 Jul 10 16:56 m1.i0.s2.mldocking.0.0000.exec.sh
-rw-r--r-- 1 matitov chm155     2187 Jul 10 16:56 m1.i0.s2.mldocking.0.0000.launch.out
-rwxr--r-- 1 matitov chm155     3411 Jul 10 16:56 m1.i0.s2.mldocking.0.0000.launch.sh
-rw-r--r-- 1 matitov chm155 13081219 Jul 10 20:28 m1.i0.s2.mldocking.0.0000.out
-rw-r--r-- 1 matitov chm155    69660 Jul 10 20:28 m1.i0.s2.mldocking.0.0000.prof

and task wasn't reported back to the TMGR

mtitov commented 4 months ago

ah, sorry, did check the log itself, but it has error message

1720645075.369 : agent_staging_output.0000 : 62621 : 140737144416000 : DEBUG    : put bulk TMGR_STAGING_OUTPUT_PENDING: 1: agent_collecting_queue
1720657683.690 : agent_staging_output.0000 : 62621 : 140737144416000 : DEBUG    : advance bulk: 1 [False, True, AGENT_STAGING_OUTPUT]
1720657683.747 : agent_staging_output.0000 : 62621 : 140737144416000 : ERROR    : staging prep error
Traceback (most recent call last):
  File "/ccs/proj/chm155/IMPECCABLE/miniconda/envs/rct/lib/python3.9/site-packages/radical/pilot/agent/staging_output/default.py", line 82, in work
    self._handle_task_stdio(task)
  File "/ccs/proj/chm155/IMPECCABLE/miniconda/envs/rct/lib/python3.9/site-packages/radical/pilot/agent/staging_output/default.py", line 181, in _handle_task_stdio
    for line in stderr_f.readlines():
  File "/ccs/proj/chm155/IMPECCABLE/miniconda/envs/rct/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 4080-4081: invalid continuation byte
andre-merzky commented 4 months ago

There are two issues here: (a) the task should be failed on that error, and (b) the error should not happen. Can you please attach the task's stderr file?