mendix / m2ee-tools

m2ee, the Mendix runtime helper tools for GNU/Linux
Other
27 stars 40 forks source link

Fix hanging client actions when runtime does not respond. #26

Closed knorrie closed 5 years ago

knorrie commented 6 years ago

Also see https://github.com/mendix/m2ee-tools/pull/12

Moving over from there:

"Currently if someone will damage their running app so badly that it keeps accepting connections and requests but never answer them, you're also quite out of luck with the CLI. Starting m2ee will just hang on the first runtime_status call, and you won't get your prompt to try stop the app.

I'd propose to change the timeout=None in client.request(action, params, timeout) and set it to a sane low default that should be sufficient to do almost all admin actions, which should return instantly. The other ones which can take longer, like start and stop already define their custom timeouts."

knorrie commented 6 years ago

Hm, timeouts are nasty.

So, what are the actual current problem we have when the admin port accepts connections, but just never sends anything back at all?

  1. When starting the CLI, it just hangs, and you can't do anything because the 'status' command which is triggered immediately hangs. (Note to self: move this from the constructor to the calling code in main, since status shouldn't be done in one-off mode.)
  2. Nagios checks keep hanging indefinitely and pile up in the OS.

When you're at the CLI prompt, then multiple other actions will seem to hang (show_critical_log_messages etc...) but when you're done waiting, that's solved with a quick Ctrl-C to abort it. So that's not a big problem.

Calls from the Cloud Portal in Mendix Cloud v3 already have timeouts on different layers. When the admin port does not respond in time, the complete subprocess which is doing the request is simply killed, so nothing keeps hanging around. This could be improved by pushing the timeout further down, which can improve error reporting. But, not urgent.

I think my proposal to set a general timeout on things is bad. I'm going to do it differently. The caller has to be able to specify a timeout when wanted, but by default there shouldn't be. This means adding timeout param with default None to all helper functions in client.py.

Afterwards, just specify it for the first status call, and in the nagios plugin, and then it's good for now. The same case was already solved for munin in b139dd2dd6 with a specific timeout for the statistics calls.

knorrie commented 6 years ago

By the way, how to test without having to cause gc overhead limits and other breakage? socat -ly TCP-LISTEN:31337,reuseaddr,fork 'EXEC:sleep 31337'

And then change config for admin port config to that port. :)

knorrie commented 6 years ago

Implemented in develop branch.

knorrie commented 6 years ago

Doing a stop is also possible now when the app is completely borked:

-$ ./m2ee.py -y stop
WARNING: Admin API not available, so not able to use shutdown action.
INFO: Waiting for the JVM process to disappear...
WARNING: The application process seems not to respond to any command or signal.
INFO: Waiting for the JVM process to disappear...
INFO: The JVM process has been destroyed.
knorrie commented 5 years ago

Included in v7.2-rc1