phoenixframework / flame

Fleeting Lambda Application for Modular Execution: auto scale parts of your existing app with a single function call
MIT License
945 stars 41 forks source link

Preventing Restarts #25

Closed cadebward closed 7 months ago

cadebward commented 8 months ago

Hello! Thanks for all you guys do in the phoenix community!

Issue

It seems as though a deployment (in fly, at least) sends a SIGTERM to the FLAME node, which causes the machine to spin down along with the other non-flame nodes. Imagine starting a heavy ffmpeg process, only to get spun down 10 seconds later without a chance to finish.

In my testing, it does not appear to respect the shutdown_timeout when the platform issues a shutdown.

To test this, I ran the following code:

defmodule FFMPEG do
  def start_test_stream do
    FLAME.cast(fn ->
      System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
    end)
  end
end

iex> FFMPEG.start_test_stream()

This spins up a test stream and sends it to the given URL. This was a quick and dirty way to get ffmpeg into a long running process (1 hour) so I could have time to trigger and observe a deploy.

In this test I also set the following options:

      {
        FLAME.Pool,
        name: FFMpegRunner,
        min: 0,
        max: 5,
        max_concurrency: 1,
        single_use: false,
        timeout: :timer.hours(1),
        boot_timeout: :timer.minutes(1),
        shutdown_timeout: :timer.hours(1),
        idle_shutdown_after: 10_000
      }

Potential Solutions

I'm open to any ideas.

chrismccord commented 8 months ago

Check the Termination docs: https://github.com/phoenixframework/flame/blob/3a03d21db06f099e811db1e06d24f4b037b513ba/lib/flame.ex#L154-L161

You can configure the terminator's shutdown timeout. For now it needs to be a global config, but since we only allow a 1:1 named pool we should be able to pass the pool's shutdown_timeout to the child. For now you can configure it globally:

config :flame, :terminator, shutdown_timeout: :timer.hours(1)

Note, you can also pass it to the pool as an env: argument, and then in your runtime.exs, you could reference the env var to get per-pool termination configuration:

config :flame, :terminator, shutdown_timeout: String.to_integer(System.get_env("FLAME_SHUTDOWN_TIMEOUT") || "10000")

Thanks!

cadebward commented 8 months ago

Thanks for the response! I did have that globally configured. It still seems as though the deployment immediately shuts down the FLAME node. I'll do some more testing to be sure.

chrismccord commented 8 months ago

Let me know! The started flame machines are completely independent of the app machines (fly infra wise) so a deployment won't tear them down on the fly side. So the only possibility is the parent node going down triggers the children to terminate per the regular monitoring flow, but the configured termination time is either ignored or there is a bug in the terminator causing an early exit.

chrismccord commented 8 months ago

@cadebward I believe I found a bug where we don't honor shutdown timeout in all shutdown cases. Can you try main and report back? I'll be able to take a closer look in a bit. Thanks!

cadebward commented 8 months ago

Hmm... something else is going on. After testing this out through a few deploys, it seems as though the FLAME node does not get swept up in a deployment. A new machine comes up, old machines spin down. The FLAME node is not included in these logs. In this example, my FLAME node is 17816409ad1408.

Updating existing machines in 'app-name' with bluegreen strategy

Cleanup Previous Deployment
  No hanging machines from a failed previous deployment

Creating green machines
  Created machine 3d8d74e3a74138 [app]

Waiting for all green machines to start
  Machine 3d8d74e3a74138 [app] - started

Waiting for all green machines to be healthy
  Machine 3d8d74e3a74138 [app] - 1/1 passing

Marking green machines as ready
  Machine 3d8d74e3a74138 [app] now ready

Waiting before cordoning all blue machines
  Machine 17816472fdd268 [app] cordoned

Waiting before stopping all blue machines

Stopping all blue machines

Waiting for all blue machines to stop
  Machine 17816472fdd268 [app] - stopped

Destroying all blue machines
  Machine 17816472fdd268 [app] destroyed

Deployment Complete

That all seems as expected. Then out of nowhere these logs show up indicating that the FLAME node has been sent a SIGTERM and is shutting down.

2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] INFO Main child exited normally with code: 0
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] INFO Starting clean up.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]2024/03/08 23:38:45 listening on [fdaa:0:44f2:a7b:241:18ab:d029:2]:22 (DNS: [fdaa::3]:53)
[flv @ 0x56420e644f00] Failed to update header with correct duration.12.2kbits/s speed=   1x    e=00:04:59.32 bitrate= 212.7kbits/s speed=   1x
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info][flv @ 0x56420e644f00] Failed to update header with correct filesize.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]frame= 7880 fps= 25 q=8.6 Lsize=    8161kB time=00:05:15.16 bitrate= 212.1kbits/s speed=   1x
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]video:8038kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.534419%
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]Exiting normally, received signal 15.
2024-03-08T23:38:46Z app[17816409ad1408] dfw [info][  320.770011] reboot: Restarting system
2024-03-08T23:38:46Z runner[17816409ad1408] dfw [info]machine restart policy set to 'no', not restarting
[ no more logs for 17816409ad1408 after this point]

I am not sure who is sending that signal.

cadebward commented 8 months ago

Does the flame pool need to be included in the supervision tree of flame nodes?

chrismccord commented 8 months ago

Sorry I got sidetracked assuming it was a config issue. Currently spawned processes via cast are not directly supervised, so they are terminated by the VM as soon as the system is shutdown, but you can trap exit to avoid being nuked (with caveats):

    FLAME.cast(fn ->
      Process.flag(:trap_exit, true)
      System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
    end)

The issue here is we don't control the full lifecycle of the cast, we only monitor it from the terminator, where FLAME.place_child owns the spawned child is a better fit. FLAME.place_child will teardown placed processes gracefully up to the shutdown_timeout – but there's a catch. By default we link to the caller so you need to explicitly unlink or trap exits for longterm workloads where the remote caller goes away:

  {:ok, pid} = FLAME.place_child(MyRunner, {Task, fn ->
    System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
  end})
  Process.unlink(pid) # or trap exit on placed child

All that said, we can definitely make this better. Casts should be properly monitored and shutdown blocked up to shutdown_timeout just like placed children, without you having to worry about it. We could also introduce a :link option on place_child, but I need to think more about it.

chrismccord commented 8 months ago

Does the flame pool need to be included in the supervision tree of flame nodes?

no

chrismccord commented 8 months ago

The added nuance here (for myself and others) is we do a lot to avoid extra copying as part of the messagings/spawning of flame work, so easy defacto defaults to take care of this on behalf of the user for cast require an extra copy, and it only matters in the case of long-running work, which is probably best suited by placing a worker on another node. Still, I understand just wanting to run a function somewhere and not setting up a worker. Having cast trap exits by default and the terminator force kill the casts at the end of shutdown_timeout may be the best option.

cadebward commented 8 months ago

I deployed this code:

{:ok, pid} = FLAME.place_child(MyRunner, {Task, fn ->
    System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})
Process.unlink(pid)

I then started the process and ran fly apps restart app-name. The FLAME node immediately shut down :sob:

I then deployed this change instead:

FLAME.place_child(MyRunner, {Task, fn ->
  Process.flag(:trap_exit, true)
  System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})

The result was the exact same.

The started flame machines are completely independent of the app machines (fly infra wise) so a deployment won't tear them down on the fly side.

I'm wondering about this. How can we double check that this is the case? I don't know anything about computers, but I have a hunch fly is still forcing this machine offline. A few seconds after I trigger the restart I see the logs:

2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] INFO Main child exited normally with code: 0
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] INFO Starting clean up.
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-11T17:06:12Z app[784e774a20ed78] dfw [info][   63.496255] reboot: Restarting system
2024-03-11T17:06:12Z runner[784e774a20ed78] dfw [info]machine restart policy set to 'no', not restarting
cadebward commented 8 months ago

Correction: I did not notice the subtle differences between your two code snippets. I tried this code:

    FLAME.cast(fn ->
      Process.flag(:trap_exit, true)
      System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
    end)

And it DOES seem to be working as expected!

UPDATE: I also cancelled the ffmpeg process and the machine spun down as expected.

chrismccord commented 8 months ago

Nice. Correction to my previous comment, casts with trap exit won't risk hanging the VM as its still the terminator awaiting the calls/casts/place_child's to exit within shutdown_timeout, so provided you have Process.flag(:trap_exit, true) in your cast, the terminator will allow the system shutdown to continue after the shutdown timeout.

chrismccord commented 7 months ago

0.1.12 supports explicit link: false to FLAME.cast|call|place_child so you don't need to unlink/trap yourself. Thanks!