Closed cadebward closed 7 months ago
Check the Termination
docs:
https://github.com/phoenixframework/flame/blob/3a03d21db06f099e811db1e06d24f4b037b513ba/lib/flame.ex#L154-L161
You can configure the terminator's shutdown timeout. For now it needs to be a global config, but since we only allow a 1:1 named pool we should be able to pass the pool's shutdown_timeout to the child. For now you can configure it globally:
config :flame, :terminator, shutdown_timeout: :timer.hours(1)
Note, you can also pass it to the pool as an env:
argument, and then in your runtime.exs, you could reference the env var to get per-pool termination configuration:
config :flame, :terminator, shutdown_timeout: String.to_integer(System.get_env("FLAME_SHUTDOWN_TIMEOUT") || "10000")
Thanks!
Thanks for the response! I did have that globally configured. It still seems as though the deployment immediately shuts down the FLAME node. I'll do some more testing to be sure.
Let me know! The started flame machines are completely independent of the app machines (fly infra wise) so a deployment won't tear them down on the fly side. So the only possibility is the parent node going down triggers the children to terminate per the regular monitoring flow, but the configured termination time is either ignored or there is a bug in the terminator causing an early exit.
@cadebward I believe I found a bug where we don't honor shutdown timeout in all shutdown cases. Can you try main and report back? I'll be able to take a closer look in a bit. Thanks!
Hmm... something else is going on. After testing this out through a few deploys, it seems as though the FLAME node does not get swept up in a deployment. A new machine comes up, old machines spin down. The FLAME node is not included in these logs. In this example, my FLAME node is 17816409ad1408
.
Updating existing machines in 'app-name' with bluegreen strategy
Cleanup Previous Deployment
No hanging machines from a failed previous deployment
Creating green machines
Created machine 3d8d74e3a74138 [app]
Waiting for all green machines to start
Machine 3d8d74e3a74138 [app] - started
Waiting for all green machines to be healthy
Machine 3d8d74e3a74138 [app] - 1/1 passing
Marking green machines as ready
Machine 3d8d74e3a74138 [app] now ready
Waiting before cordoning all blue machines
Machine 17816472fdd268 [app] cordoned
Waiting before stopping all blue machines
Stopping all blue machines
Waiting for all blue machines to stop
Machine 17816472fdd268 [app] - stopped
Destroying all blue machines
Machine 17816472fdd268 [app] destroyed
Deployment Complete
That all seems as expected. Then out of nowhere these logs show up indicating that the FLAME node has been sent a SIGTERM
and is shutting down.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] INFO Main child exited normally with code: 0
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] INFO Starting clean up.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]2024/03/08 23:38:45 listening on [fdaa:0:44f2:a7b:241:18ab:d029:2]:22 (DNS: [fdaa::3]:53)
[flv @ 0x56420e644f00] Failed to update header with correct duration.12.2kbits/s speed= 1x e=00:04:59.32 bitrate= 212.7kbits/s speed= 1x
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info][flv @ 0x56420e644f00] Failed to update header with correct filesize.
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]frame= 7880 fps= 25 q=8.6 Lsize= 8161kB time=00:05:15.16 bitrate= 212.1kbits/s speed= 1x
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]video:8038kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.534419%
2024-03-08T23:38:45Z app[17816409ad1408] dfw [info]Exiting normally, received signal 15.
2024-03-08T23:38:46Z app[17816409ad1408] dfw [info][ 320.770011] reboot: Restarting system
2024-03-08T23:38:46Z runner[17816409ad1408] dfw [info]machine restart policy set to 'no', not restarting
[ no more logs for 17816409ad1408 after this point]
I am not sure who is sending that signal.
Does the flame pool need to be included in the supervision tree of flame nodes?
Sorry I got sidetracked assuming it was a config issue. Currently spawned processes via cast are not directly supervised, so they are terminated by the VM as soon as the system is shutdown, but you can trap exit to avoid being nuked (with caveats):
FLAME.cast(fn ->
Process.flag(:trap_exit, true)
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end)
The issue here is we don't control the full lifecycle of the cast, we only monitor it from the terminator, where FLAME.place_child
owns the spawned child is a better fit. FLAME.place_child
will teardown placed processes gracefully up to the shutdown_timeout – but there's a catch. By default we link to the caller so you need to explicitly unlink or trap exits for longterm workloads where the remote caller goes away:
{:ok, pid} = FLAME.place_child(MyRunner, {Task, fn ->
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})
Process.unlink(pid) # or trap exit on placed child
All that said, we can definitely make this better. Casts should be properly monitored and shutdown blocked up to shutdown_timeout just like placed children, without you having to worry about it. We could also introduce a :link
option on place_child, but I need to think more about it.
Does the flame pool need to be included in the supervision tree of flame nodes?
no
The added nuance here (for myself and others) is we do a lot to avoid extra copying as part of the messagings/spawning of flame work, so easy defacto defaults to take care of this on behalf of the user for cast
require an extra copy, and it only matters in the case of long-running work, which is probably best suited by placing a worker on another node. Still, I understand just wanting to run a function somewhere and not setting up a worker. Having cast
trap exits by default and the terminator force kill the casts at the end of shutdown_timeout may be the best option.
I deployed this code:
{:ok, pid} = FLAME.place_child(MyRunner, {Task, fn ->
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})
Process.unlink(pid)
I then started the process and ran fly apps restart app-name
. The FLAME node immediately shut down :sob:
I then deployed this change instead:
FLAME.place_child(MyRunner, {Task, fn ->
Process.flag(:trap_exit, true)
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end})
The result was the exact same.
The started flame machines are completely independent of the app machines (fly infra wise) so a deployment won't tear them down on the fly side.
I'm wondering about this. How can we double check that this is the case? I don't know anything about computers, but I have a hunch fly is still forcing this machine offline. A few seconds after I trigger the restart I see the logs:
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] INFO Main child exited normally with code: 0
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] INFO Starting clean up.
2024-03-11T17:06:11Z app[784e774a20ed78] dfw [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2024-03-11T17:06:12Z app[784e774a20ed78] dfw [info][ 63.496255] reboot: Restarting system
2024-03-11T17:06:12Z runner[784e774a20ed78] dfw [info]machine restart policy set to 'no', not restarting
Correction: I did not notice the subtle differences between your two code snippets. I tried this code:
FLAME.cast(fn ->
Process.flag(:trap_exit, true)
System.cmd("ffmpeg", ~w(-t 01:00:00 -re -f lavfi -i testsrc -f flv URL_GOES_HERE))
end)
And it DOES seem to be working as expected!
UPDATE: I also cancelled the ffmpeg process and the machine spun down as expected.
Nice. Correction to my previous comment, casts with trap exit won't risk hanging the VM as its still the terminator awaiting the calls/casts/place_child's to exit within shutdown_timeout, so provided you have Process.flag(:trap_exit, true)
in your cast, the terminator will allow the system shutdown to continue after the shutdown timeout.
0.1.12 supports explicit link: false
to FLAME.cast|call|place_child
so you don't need to unlink/trap yourself. Thanks!
Hello! Thanks for all you guys do in the phoenix community!
Issue
It seems as though a deployment (in fly, at least) sends a
SIGTERM
to the FLAME node, which causes the machine to spin down along with the other non-flame nodes. Imagine starting a heavyffmpeg
process, only to get spun down 10 seconds later without a chance to finish.In my testing, it does not appear to respect the
shutdown_timeout
when the platform issues a shutdown.To test this, I ran the following code:
This spins up a test stream and sends it to the given URL. This was a quick and dirty way to get
ffmpeg
into a long running process (1 hour) so I could have time to trigger and observe a deploy.In this test I also set the following options:
Potential Solutions
fly_backend
specifically, I wonder if we could adjust the POST to the machines API to spin up a node as part of another app? That way whenapplication_1
gets a new deployment, the FLAME node inside ofapplication_2
is not affected. (haven't tested this yet, no idea if its even plausible)I'm open to any ideas.