pcdshub / IocManager

pyqt5 + pyca-based EPICS IOC Manager
https://confluence.slac.stanford.edu/display/PCDS/IOC+Manager+for+Users
2 stars 6 forks source link

IOCs failing to boot #4

Closed klauer closed 11 months ago

klauer commented 3 years ago

Symptom

@@@ @@@ @@@ @@@ @@@
@@@ Received a sigChild for process 15300. The process was killed by signal 9
@@@ Current time: Wed Jun  9 16:15:00 2021
@@@ Child process is shutting down, a new one will be restarted shortly
@@@ ^R or ^X restarts the child, ^Q quits the server
@@@ Restarting child "ioc-mfx-tfs-lens"
@@@    (as /reg/g/pcds/pyps/config/mfx/iocmanager/startProc)
@@@ The PID of new child "ioc-mfx-tfs-lens" is: 15734
@@@ @@@ @@@ @@@ @@@

and nothing happens, even waiting a long time. No output except for what procServ dumps out.

I've seen this happen multiple times, and things usually recover at some point (30 mins or more?).

Guess

I think some of the config-related tools can create problems. My guess is that the lock acquisition and our poor, finicky network filesystems create hostile environments for these tools:

$ python /reg/g/pcds/pyps/config/mfx/iocmanager/getDirectory.py ioc-mfx-tfs-lens mfx

The above locks the terminal, blocking Ctrl-C and even Ctrl-Z for the longest time. 3 minutes later, and still no luck getting the IOC directory.

strace

Let's see what strace shows... Oh, now it's back to working, so strace output is useless.

Suspicious code?

I'd guess it hangs here: https://github.com/pcdshub/IocManager/blob/39a284dbf41d84f743d78e994654f160c1337383/utils.py#L490

Any thoughts on the above or where else it might be, @mcb64?

mcb64 commented 3 years ago

I think this is a very accurate assessment. However, I'm not sure what to do here. If we don't lock, it's unlikely we'd be starting at the same time as a file update, but...

The basic problem is that Linux file locking in an NFS environment is notoriously unreliable. And we're not using a real DB here... just a text file.

That said... perhaps the solution is to switch to mongoDB or something similar and let it take care of the concurrent operations? (We're basically just reading the text file into a dictionary. No reason we couldn't do something similar in a mongoDB. Even if we just dumped the entire hutch configuration into one document/record/whatever the hell they are called.

klauer commented 3 years ago

I agree, I think long-term a database is probably the right way to go. For now, I suppose the current implementation is about as safe as it can get, just dealing with a configuration file that can be truncated-in-place and rewritten.

As to the IOC boot procedures, I wonder if the lock really matters, though. Let's say the config file was being written, and getDirectory races to try to read the file. It fails, and the boot procedure fails, and (hopefully) procServ restarts in a few seconds or so.

I have a feeling I'm biased here, others would likely say that just waiting a few minutes is preferable, and have it boot right the first time!

mcb64 commented 3 years ago

Yeah, it's probably safe enough to just remove it.

This whole 'commithost' thing is for the birds though. It would be good to just use a db somewhere and let it deal with everything.

mcb64 commented 3 years ago

Incidently, this doesn't seem to be the problem. I was running into this on ioc-xpp-rec03, and so I commented out the locking. Nope.

klauer commented 3 years ago

Huh, darn ☹️ Did getDirectory when called manually fail to run for that machine? strace on that perhaps?

(It has to be network file system something I feel, but I've been wrong many-a-time before...)

mcb64 commented 3 years ago

Yeah, it seems to be failing in general. I agree, it's a file system problem... maybe the exclusive locking is screwing things up worse than I thought? --Mike


From: Ken Lauer @.> Sent: Thursday, June 17, 2021 10:19 AM To: pcdshub/IocManager @.> Cc: Browne, Michael C. @.>; Mention @.> Subject: Re: [pcdshub/IocManager] IOCs failing to boot (#4)

Huh, darn ☹️ Did getDirectory when called manually fail to run for that machine? strace on that perhaps?

(It has to be network file system something I feel, but I've been wrong many-a-time before...)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/pcdshub/IocManager/issues/4#issuecomment-863420664, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH3RBO5NIJBNYQF4AMLRLYLTTIVDRANCNFSM46M7SVKQ.

klauer commented 11 months ago

This was due to file system issues. Closing as "cannot reproduce"