ptrlv / adc

Misc tools for ADC
1 stars 11 forks source link

line 260: kill: 11243: invalid signal specification #13

Open rptaylor opened 4 years ago

rptaylor commented 4 years ago

Hi @ptrlv

Take a look at https://aipanda023.cern.ch/condor_logs_2/19-09-18_17/grid.4106118.3.err

2019-09-18 17:40:58 UTC [wrapper] ==== wrapper stderr BEGIN ====
./runpilot2-wrapper.sh: line 260: kill: 11243: invalid signal specification

It looks like something went wrong with the kill statement on this line: https://github.com/ptrlv/adc/blob/5826b770cebcc3f166e2c38f15e226801201fb34/runpilot2-wrapper.sh#L260

Is 11243 supposed to be the PID rather than the signal? Or maybe the signal ($1) was null ?

This job failed https://bigpanda.cern.ch/job?pandaid=4487072979 taskbuffer, 300: The worker was finished while the job was running : None

Not sure if that is a result of or related to this.

Thanks!

ptrlv commented 4 years ago

Thanks for the report. Yes $1 should be the signal name but looks like it was missing. Can you check what signal was actually sent in this case?

rptaylor commented 4 years ago

kill returned the error "invalid signal specification" so I would guess it failed and did not send any signal at all.

ptrlv commented 4 years ago

Sure but this is only trapped by the wrapper when a signal is sent. Any record from the batch system logs?

rptaylor commented 4 years ago

It's hard to tell what signal a process would have received; I don't think it's logged. But the job was killed one way or another.

Have you seen this happen before?

rptaylor commented 4 years ago

Hmm I found another case (there are probably a lot) https://bigpanda.cern.ch/job?pandaid=4488870845 The .out terminates abruptly at 16:58 https://aipanda024.cern.ch/condor_logs_2/19-09-20_07/grid.15641138.2.out The .err https://aipanda024.cern.ch/condor_logs_2/19-09-20_07/grid.15641138.2.err

2019-09-20 11:03:27 UTC [wrapper] ==== wrapper stderr BEGIN ====
./runpilot2-wrapper.sh: line 260: kill: 31239: invalid signal specification

In this case I found a message on the same worker node (presumably same time in another TZ):

Sep 20 09:58:20 hermes-kvm002.westgrid.uvic.ca root[34410]: killing hermes-kvm002 31239
Sep 20 09:58:20 hermes-kvm002.westgrid.uvic.ca root[34410]: killing hermes-kvm002 31239
Sep 20 09:58:20 hermes-kvm002.westgrid.uvic.ca root[34447]: KILLING hermes-kvm002 31239

Checking the scripts, I see it first sent a plain kill and then kill -9 to process 31239.

Seems like somehow the trap_handler function is not being called as expected? Not sure if this interferes with the subsequent cleanup and termination of the job.

ptrlv commented 4 years ago

Hi Ryan, seems like this has always been broken since the trap command doesn't send the signal as an argument. The fix looks like this: https://gist.github.com/ptrlv/c5740b176578e4cbf7a85426419fc1bf

Perhaps you can check this before I deploy in the wrapper. Thanks!