hauleth / erlang-systemd

systemd utilities for Erlang applications
Apache License 2.0
174 stars 19 forks source link

Plug example fixes #24

Closed elucid closed 3 years ago

elucid commented 3 years ago

Hello, and thank you very much for your work on this library. I was looking for a simple way to manage no-downtime Elixir deploys with systemd and found your library.

I have a few small changes to suggest to the plug example: the first looks like you forgot to change the name of the socket service in plug.service when you updated from the previous example. The second just changes how releases are copied into the service directory so that you can make install-rel more than once.

As an aside, I've put together an example of using socket activation for deploying a Phoenix app. I would make a pull-request here but it's probably too many files. Would you be interested instead in a pull request to add some Phoenix integration code snippets to README.md?

hauleth commented 3 years ago

I have a few small changes to suggest to the plug example: the first looks like you forgot to change the name of the socket service in plug.service when you updated from the previous example.

No, I meant epmd.socket which is socket on which EPMD listens. In my example I assumed that there is OS installation of EPMD that provides that service for you.

The second just changes how releases are copied into the service directory so that you can make install-rel more than once.

Thanks, that indeed seems useful.

I would make a pull-request here but it's probably too many files.

If you reduce all boilerplate from Phoenix that is not relevant to the example (like Ecto setup, migrations, tests, etc.) then I think we can add it as an example directly there. You can publish such repo and I will try to merge that on my own if you want.

hauleth commented 3 years ago

💚 💙 💜 💛 ❤️

elucid commented 3 years ago

Okay, thanks for clarifying. I was a little confused because my setup didn't have an epmd.socket service due to my not using an OS erlang install.

Completely unrelated, would you be able to provide some clarification as to how restarts are supposed to work in the plug example? If I install the services via make start, and run systemctl restart plug.service while I am making many requests using e.g. apache bench, I get a handful of Connection reset by peer errors. I would have expected all of the requests to have succeeded, where requests being handled by the old service are allowed to complete before it is shut down, and requests made before the new service comes up queued in the systemd socket buffer. The latter seems to be occurring because I can see that my longest request is reported at around 4000ms if there is a restart during the benchmark, whereas it is typically only 6ms when no restart is happening. So it would seem that some of the requests being handled at the time of the restart are dropped.

I've setup a similar systemd socket activation test with a Ruby webserver which supports having file descriptors passed in, and if I restart during a test, there are no failed requests.

Should the plug service be restarted in a different way? Is it not receiving the correct signal from systemd to shut down?

hauleth commented 3 years ago

The latter seems to be occurring because I can see that my longest request is reported at around 4000ms if there is a restart during the benchmark, whereas it is typically only 6ms when no restart is happening. So it would seem that some of the requests being handled at the time of the restart are dropped.

I've setup a similar systemd socket activation test with a Ruby webserver which supports having file descriptors passed in, and if I restart during a test, there are no failed requests.

Should the plug service be restarted in a different way? Is it not receiving the correct signal from systemd to shut down?

At the time of being this is due to how Erlang handles sockets. Right now Erlang does that by starting separate process (called port) for handling socket. That process is closed on each restart, which causes sockets to fail and needed to be restarted by the systemd. With OTP 24 it should probably start to be better with new socket module, which will make all sockets to be handled by the VM itself instead of separate child process. This also makes "storing" sockets impossible task at the time of being, as Erlang do not provide a way to get the FD of the socket via gen_{tcp,udp} (gen_sctp support is even worse). So maybe in future it will be easier to make restarts of the application truly 0-downtime, but we need to wait for updates in the BEAM itself until that can really happen. I am trying to find a way to make it work as-is, but it will probably take some (help welcome though).

I tried running examples with -kernel inet_backend socket, but for now it seems that there is no messages at all (it seems like the passed file descriptor).

hauleth commented 3 years ago

@elucid I have created https://github.com/erlang/otp/issues/4680 to check whether socket based implementation will fix that problem.

elucid commented 3 years ago

@hauleth Thanks for the clarification. I will have to spend more time reading source to understand what is going on with the socket handling. My understanding of this is weak.

Last night I had spent a bunch of time trying to make sense of the role connection draining plays here. In my tests hundreds of requests are initiated and successfully completed on the old Elixir process after the draining begins. I would have expected the server to stop accepting new connections and wait for all pending requests to complete before finally shutting down.

hauleth commented 3 years ago

Last night I had spent a bunch of time trying to make sense of the role connection draining plays here. In my tests hundreds of requests are initiated and successfully completed on the old Elixir process after the draining begins.

Hmm, weird as it shouldn't be a case. Can you provide me MVCE for such test? I have been testing this when working on systemd library via:

  1. Firing curl localhost/slow
  2. Initiating shutdown via sudo systemctl stop plug
  3. Firing curl localhost

Then the localhost/slow request was finished successfully while localhost request patiently waited for restart.

EDIT: Ideally, open separate issue for this and I will look into that.

hauleth commented 3 years ago

@elucid could you share Ruby example of application that works with restarts?

elucid commented 3 years ago

@hauleth sure: https://github.com/elucid/systemd-puma-example

Sorry--I have meant to try to boil down a MVCE for you but have had a busy week.

^ This should run out of the box, but you will probably have to modify WorkingDirectory, User, and Group inside of systemd/puma.service to match your local setup. I just had this simple setup where systemd uses files in my checkout so that I could easily modify and make restarts to test the effect of various changes.

The readme should more or less over what you need to run. Here is the test I was doing that led me to claim that conclude that restarts are causing requests to drop:

Puma:

  1. test the endpoint (e.g. curl localhost:3333/hello to make sure it is up)
  2. start an ab run to make a lot of requests (e.g. ab -r -n 10000 -c 10 http://localhost:3333/hello)
  3. while the ab run is still going, in another shell systemctl restart puma.service. you might need to increase the number of requests you are running to give yourself enough time to do the restart before the test ends. I did the test so many times I got pretty fast at triggering the restarts during the run :)

Elixir: same as above but with http://localhost/

With the puma service, you can see that the restart causes a slight delay, ~200ms, on a few requests, but no requests fail.

With the plug service (or a similar test with a Phoenix application), the restart delay is much longer (~4-4.5s) and there are almost always failed requests. If you remove Plug.Cowboy.Drainer you still get about the same number of failed requests, but the restart is faster (~3.5s).

When I left off trying to figure this out, I had added logging statements to plug_cowboy in the drainer code. By looking at the request logs I could see that during a test, after request draining was initiated, the old elixir pid would still serve many hundreds of requests. I'm not sure what to make of this. I would have thought that after the restart was triggered, the old application would stop servicing new requests but apparently that is not the case. It is possible that the requests that are being serviced at the time the restart is triggered indeed complete without being terminated, but some of the requests that are flooding in after the restart are brutally killed. I'm really not sure.

hauleth commented 3 years ago

@elucid I cannot find a reason for the problem. With the OTP 24 and socket backend for gen_tcp it works better, but I still cannot find how to stop connection dropping during restart. I will probably need to ask on the systemd mailing list as I am unable to find the solution right now.

elucid commented 3 years ago

@hauleth do you think it is related to systemd, or the way that connection draining is implemented why do you think it is related to systemd rather than the way that connection draining is implemented?

hauleth commented 3 years ago

@elucid I do not think that it is related to systemd, I think it is somehow related to startup times of Erlang VM, but it should be possible somehow to work around it. Also drainage shouldn't be really a problem, but a solution there, as non-accepted requests should wait in the opened socket queue. So it is more about looking for the workaround rather bug in systemd.