Open rptaylor opened 4 years ago
Thanks for the report. Yes $1 should be the signal name but looks like it was missing. Can you check what signal was actually sent in this case?
kill returned the error "invalid signal specification" so I would guess it failed and did not send any signal at all.
Sure but this is only trapped by the wrapper when a signal is sent. Any record from the batch system logs?
It's hard to tell what signal a process would have received; I don't think it's logged. But the job was killed one way or another.
Have you seen this happen before?
Hmm I found another case (there are probably a lot) https://bigpanda.cern.ch/job?pandaid=4488870845 The .out terminates abruptly at 16:58 https://aipanda024.cern.ch/condor_logs_2/19-09-20_07/grid.15641138.2.out The .err https://aipanda024.cern.ch/condor_logs_2/19-09-20_07/grid.15641138.2.err
2019-09-20 11:03:27 UTC [wrapper] ==== wrapper stderr BEGIN ====
./runpilot2-wrapper.sh: line 260: kill: 31239: invalid signal specification
In this case I found a message on the same worker node (presumably same time in another TZ):
Sep 20 09:58:20 hermes-kvm002.westgrid.uvic.ca root[34410]: killing hermes-kvm002 31239
Sep 20 09:58:20 hermes-kvm002.westgrid.uvic.ca root[34410]: killing hermes-kvm002 31239
Sep 20 09:58:20 hermes-kvm002.westgrid.uvic.ca root[34447]: KILLING hermes-kvm002 31239
Checking the scripts, I see it first sent a plain kill
and then kill -9
to process 31239.
Seems like somehow the trap_handler function is not being called as expected? Not sure if this interferes with the subsequent cleanup and termination of the job.
Hi Ryan, seems like this has always been broken since the trap command doesn't send the signal as an argument. The fix looks like this: https://gist.github.com/ptrlv/c5740b176578e4cbf7a85426419fc1bf
Perhaps you can check this before I deploy in the wrapper. Thanks!
Hi @ptrlv
Take a look at https://aipanda023.cern.ch/condor_logs_2/19-09-18_17/grid.4106118.3.err
It looks like something went wrong with the kill statement on this line: https://github.com/ptrlv/adc/blob/5826b770cebcc3f166e2c38f15e226801201fb34/runpilot2-wrapper.sh#L260
Is 11243 supposed to be the PID rather than the signal? Or maybe the signal ($1) was null ?
This job failed https://bigpanda.cern.ch/job?pandaid=4487072979
taskbuffer, 300: The worker was finished while the job was running : None
Not sure if that is a result of or related to this.
Thanks!