radiasoft / sirepo

Sirepo is a framework for scientific cloud computing. Try it out!
https://sirepo.com
Apache License 2.0
63 stars 31 forks source link

webcon: softIoc not exiting on alpha #1898

Closed moellep closed 4 years ago

moellep commented 4 years ago

The EPICS process isn't getting stopped correctly and after the job is canceled, it remains running on alpha. This problem didn't appear on dev, probably because of the different job runner.

To reproduce, start EPICS on the webcon controls tab and then stop it. check-sirepo-processes.pl shows it remains running in the "canceled" status:

/srv/sirepo/db/user/q3szgY1H/webcon/NR8K6Ftp/epicsServerAnimation
 Sep 19 15:51 canceled [29238]
$ ps axl | grep 29238
0  1000 14991 44763  20   0 112708   960 pipe_w S+   pts/0      0:00 grep 29238
4  1000 29238 24594  20   0 1716048 112316 poll_s S  ?          0:16 /home/vagrant/.pyenv/versions/2.7.16/envs/py2/bin/python2.7 /home/vagrant/.pyenv/versions/py2/bin/sirepo webcon run-background /srv/sirepo/db/user/q3szgY1H/webcon/NR8K6Ftp/epicsServerAnimation
0  1000 30073 29238  20   0   6704  1624 do_wai S    ?          0:00 /bin/sh -c softIoc epics-boot.cmd > epics.log
0  1000 30074 29238  20   0      0     0 do_exi Z    ?          0:00 [camonitor] <defunct>
robnagler commented 4 years ago

Not reaped (zombie). With the new job execution system, we'll be able to address this better.

How long does camonitor need to run? Just for the simulation run?

moellep commented 4 years ago

Yes, the camonitor should only run while the softIoc process is running to write the updates to a log file.

Bad code is here:

https://github.com/radiasoft/sirepo/blob/master/sirepo/pkcli/webcon.py#L185

robnagler commented 4 years ago

Definitely don't want it grabbing SIGTERM. This should be centrally managed.

The new job exec stuff is coming along nicely and it'll account for this especially on prod. I think it's ok basically the way it is, because softloc has to talk to camonitor so they should be in the same job. We should just make sure they are in the same process group (possibly with a manual fork exec).

robnagler commented 4 years ago

job_driver.Local (dev) uses process groups. Docker shuts down the container. Not an issue for now.