`start_async` doesn't allow passing options to the `Task.Supervisor`

phoenixframework / phoenix_live_view

Rich, real-time user experiences with server-rendered HTML

https://hex.pm/packages/phoenix_live_view

MIT License

6.22k stars 931 forks source link

`start_async` doesn't allow passing options to the `Task.Supervisor` #3304

Closed probably-not closed 4 months ago

probably-not commented 4 months ago

Environment

Elixir version (elixir -v): 1.16.3
Phoenix version (mix deps): 1.7.12
Phoenix LiveView version (mix deps): 0.20.14
Operating system: macOS 14.5
Browsers you attempted to reproduce this bug on (the more the merrier): All (it's happening on the server not the browser)
Does the problem persist after removing "assets/node_modules" and trying again? Yes/no: Yes

Actual behavior

While Phoenix.LiveView.start_async/4 allows passing a :supervisor option to select a Task.Supervisor to run the task under, it does not allow passing any of the Task.Supervisor.start_child/3 options to be passed to the Task.Supervisor, meaning that we cannot configure things like the timeout of the task (which defaults to 5000 milliseconds). This makes start_async unusable for any task that may take more than 5000 milliseconds.

The Task.Supervisor docs do say that in Task.Supervisor.start_link we can pass in :restart and :shutdown options - however, doing this is deprecated in favor of passing the options in Task.Supervisor.start_child.

Expected behavior

Using Phoenix.LiveView.start_async/4 should allow passing options to the supervisor that we want to run against, to allow things like controlling the restart strategy and the shutdown strategy.

SteffenDE commented 4 months ago

This makes start_async unusable for any task that may take more than 5000 milliseconds.

I believe you are probably misunderstanding the shutdown option of Task Supervisors. Taken the following snippet

children = [
  {Task.Supervisor, name: MyApp.TaskSupervisor}
]

Supervisor.start_link(children, strategy: :one_for_one)

Task.Supervisor.start_child(MyApp.TaskSupervisor, fn ->
  Process.sleep(10000)
  IO.puts("Slept 10s, all good!")
end)

Process.sleep(:infinity)

you can see that if run (e.g. save it as task.exs and run elixir task.exs) the task completes successfully. The shutdown option of the Task.Supervisor controls how it handles termination of running tasks (-> using Task.Supervisor.terminate_child/2). By default, it sends an exit signal and gives the task 5 seconds to cleanup before killing it, but this can be controlled using the :shutdown option.

Here's an example:

children = [
  {Task.Supervisor, name: MyApp.TaskSupervisor}
]

Supervisor.start_link(children, strategy: :one_for_one)

{:ok, pid} = Task.Supervisor.start_child(MyApp.TaskSupervisor, fn ->
  Process.flag(:trap_exit, true)

  receive do
    {:EXIT, _from, _reason} ->
      IO.puts("Exit received, cleaning up!")
      Process.sleep(2000)
      IO.puts("bye!")
  end
end)

IO.puts("task started!")
Process.sleep(1000)
Task.Supervisor.terminate_child(MyApp.TaskSupervisor, pid)

Process.sleep(:infinity)

probably-not commented 4 months ago

My mistake, you are absolutely right, I did misunderstand the shutdown value and definitely misrepresented the underlying issue.

However, the underlying issue of not being able to pass in the standard start_child options still holds true - we can't affect the actual Task start configurations which includes both the shutdown and the restart values, both of which can have big implications of how the task runs and what tasks I can run using the LiveView.Async module.

For example, the shutdown value allows the user control over gracefully shutting down the task by ensuring that we have enough time to run whatever cleanup is necessary before the task is exited as you showed. I do think that is the same level of issue - if I have a task that I know can take more than 5 seconds, I need to ensure that the task can complete gracefully.

In the same vein, making sure that I can control the restart value is important based on the task type, whether I want it transient or temporary.

Not being able to pass in the standard options for a task supervisor limits (personally I would say severely but that's subjective because it depends on the tasks you want to run) the types of tasks that I can do in start_async.

I can edit the original issue text to reflect the correct underlying issue if you'd like.

josevalim commented 4 months ago

The restart option wont work as desired. For example, Task.Supervisor.async does not expose it either, because there is context exchanged between parent and child, which is not replayed on restart.

I am also skeptical of the shutdown option being useful. The child and parent are linked. During shutdown, the LiveView will be terminated before, causing the child to crash before it exits. You could trap exits, but that now means you can leak tasks once their parent LiveView terminates.

so overall, I don’t think the features will work as you expect or be free of trade-offs. I think it is fair to say that, if you want them, you can use a regular task and process communication, instead of backing those trade-offs inside the async mechanism.

probably-not commented 4 months ago

I think it is fair to say that, if you want them, you can use a regular task and process communication, instead of backing those trade-offs inside the async mechanism.

That's fair, it's what I've been doing this whole time 😅 I've been delving into LiveView and trying to use built in LV functionalities but I'm not afraid to get my hands dirty 😄

I am also skeptical of the shutdown option being useful. The child and parent are linked. During shutdown, the LiveView will be terminated before, causing the child to crash before it exits. You could trap exits, but that now means you can leak tasks once their parent LiveView terminates.

Doesn't the LiveView terminating before depend on the application supervision tree ordering? I've always been under the impression that if I place my Task.Supervisor at the end of the tree, it would terminate first, before its callers, so it would be able to handle graceful shutdowns... maybe I've misunderstood this as well?

In any case, I'm okay with implementing myself, all good! I would say however that the task types that are recommended for the start_async and the LiveView.Async functionalities should be documented, so that it's clear what to use it for and where you should reach for the base OTP process facilities.

As a very naive example (just to illustrate a task not suitable for this), based on this, I shouldn't use start_async for a task that downloads a file, assigns something based on that, and deletes the file after reading it, because if I don't have a graceful shutdown mechanism then I could end up with files being downloaded and not being deleted, and filling up my disk space.

josevalim commented 4 months ago

That's fair, it's what I've been doing this whole time 😅 I've been delving into LiveView and trying to use built in LV functionalities but I'm not afraid to get my hands dirty 😄

Fantastic!

Doesn't the LiveView terminating before depend on the application supervision tree ordering?

You got it right but if you start the supervisor after the endpoint, there is a race condition where you can start receiving requests and the task supervisor is not yet running, and then you will get crashes.

I shouldn't use start_async for a task that downloads a file, assigns something based on that, and deletes the file after reading it, because if I don't have a graceful shutdown mechanism then I could end up with files being downloaded and not being deleted, and filling up my disk space.

In such cases, I would prefer to have another process that monitors the process that downloads the file and removes the file once the process concluded. This way you have a process only for cleaning up which should be more impervious to bugs. :) This also means you can continue using the async functionality and deal only with clean up separately!

probably-not commented 4 months ago

You got it right but if you start the supervisor after the endpoint, there is a race condition where you can start receiving requests and the task supervisor is not yet running, and then you will get crashes.

True, although this is pretty easily mitigated by a health check. In previous applications that I've written, I've ensured that my healthiness endpoint would only return healthy after all of what I wanted to start was started - same solution, just putting a small process that switches a boolean flag only at the end of the tree, so until it starts the health endpoint returns an unhealthy status.

In such cases, I would prefer to have another process that monitors the process that downloads the file and removes the file once the process concluded.

Definitely agree!

Like I said though, this is a very naive example just to showcase that there are certain types of tasks that should simply not be done within the LiveView.Async functionalities, and I think the types of tasks should be reflected in the documentation to make it clear to newcomers who may be more junior in regards to architecting an application correctly. Many of the newcomers to LiveView don't know what Process.monitor does, or the difference between linking and monitoring, or even what a Supervisor or a GenServer even are. LiveView is getting a lot of press recently and in various communities I've been seeing people who are brand new to Elixir diving into LiveView without learning the basics of OTP and what a process is and just hoping for LiveView to work magically. Including information about what LiveView.Async is actually doing and what kind of tasks it should be used for should be included in the documentation to ensure that people don't get in to these types of issues.

josevalim commented 4 months ago

More docs are always welcome!

It seems at this point we diverged considerably from the original issue, so what do you propose ?

probably-not commented 4 months ago

Right, getting back on track to the issue here -

I propose that we should allow passing specific Task.Supervisor.start_child opts to be passed in to the LiveView.Async functions. Based on our exchange we've established that there may be a use-case for the :shutdown value (but not the :restart value), since there are ways to ensure the shutdown order, and that does give the ability to do graceful shutdown handling for async tasks from a LiveView.

However, I completely get if the direction here is to push people to not do this and to prefer doing only specific types of tasks via LiveView.Async.

With that being said, if the decision is to not allow passing opts - is there any reason to allow passing a :supervisor at all? The defaults of the Task.Supervisor essentially give the same behavior as no Task.Supervisor, so I'm not sure I understand the decision to allow passing a Task.Supervisor if we aren't going to allow passing opts. My feeling is that this simply confuses/complicates the usage... but to be fair, that's a subjective opinion.

And of course - regardless of any of the decisions - I do propose that the docs should be a lot more explicit here about what kind of tasks should be run with LiveView.Async's facilities, versus tasks that should prefer using standard OTP facilities and standard message passing to communicate with the LiveView. If we add support for passing the :shutdown option in, then the docs should be explicit about all that you've mentioned (the linking and shutdown order and possible race condition and mitigation tactics, etc).

josevalim commented 4 months ago

When it comes to options, I am a :-1: on the proposal. Yes, you can use shutdown if:

You define your task supervisor after the endpoint
You enable a health check
You trap exits in each task

And, even after all of these changes, you can still "leak processes" once a LiveView terminates and there is still no guarantee of cleanup after crashes. There are too many pitfalls to encourage the behaviour you described. Overall, the async/await convention in Elixir is about short-lived tasks that send a response back, and LiveView does not change that. We should revisit if there are practical use cases.

is there any reason to allow passing a :supervisor at all?

Using different supervisors can be helpful for monitoring/instrumentation. For example, you can measure the number of children to know which kind of workloads are running. Sometimes you also need to drain supervisors in tests to avoid races and breaking the apart gives you more control.

With all of this said, feel free to send a pull request that sketches improvements to the docs, that's probably the best way to move this forward, as I believe you already have suggestions in mind. I will be glad to review and suggest improvements.

probably-not commented 4 months ago

Makes sense!

I'm going to close this issue out for now, I'll submit a PR with documentation improvements based on what we spoke about soon.

Thanks!