livebook-dev / livebook

Automate code & data workflows with interactive Elixir notebooks
https://livebook.dev
Apache License 2.0
4.62k stars 408 forks source link

Teams: Unable to deploy app to app server #2679

Closed zachallaun closed 1 week ago

zachallaun commented 1 week ago

Environment

Current behavior

Giving Livebook Teams deployment a try, but having trouble deploying an app to a running app server. Here are the steps I've followed:

image

Expected behavior

The app should be deployed to the app server. 🙂

josevalim commented 1 week ago

Thanks! Do you have anything on the terminal logs? :)

zachallaun commented 1 week ago

Unfortunately I don't. Nothing is logged after the "[Livebook] Application running at..." message. Is there a flag/env var I can use to increase verbosity?

jonatanklosko commented 1 week ago

@zachallaun please give it another try, it should work now :)

zachallaun commented 1 week ago

@jonatanklosko Yep, looks to be working! Thanks!

zachallaun commented 1 week ago

Hmm, so the deploy succeeded but the app has been stuck as "preparing" for ~20 minutes now.

image

Possibly related, I'm seeing this back in my Fly logs:

2024-06-25T17:11:42Z app[2874902a1ee428] iad [info][Livebook] Application running at http://localhost:8080/
2024-06-25T17:14:38Z app[2874902a1ee428] iad [info]17:14:38.445 [error] GenServer {Livebook.HubsRegistry, "team-zachallaun"} terminating
2024-06-25T17:14:38Z app[2874902a1ee428] iad [info]** (FunctionClauseError) no function clause matching in Livebook.Hubs.Broadcasts.hub_connection_failed/2
2024-06-25T17:14:38Z app[2874902a1ee428] iad [info]    (livebook 0.13.0) lib/livebook/hubs/broadcasts.ex:102: Livebook.Hubs.Broadcasts.hub_connection_failed("team-zachallaun", %Mint.TransportError{reason: :closed})
2024-06-25T17:14:38Z app[2874902a1ee428] iad [info]    (livebook 0.13.0) lib/livebook/hubs/team_client.ex:259: Livebook.Hubs.TeamClient.handle_info/2
2024-06-25T17:14:38Z app[2874902a1ee428] iad [info]    (stdlib 6.0) gen_server.erl:2173: :gen_server.try_handle_info/3
2024-06-25T17:14:38Z app[2874902a1ee428] iad [info]    (stdlib 6.0) gen_server.erl:2261: :gen_server.handle_msg/6
2024-06-25T17:14:38Z app[2874902a1ee428] iad [info]    (stdlib 6.0) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
2024-06-25T17:14:38Z app[2874902a1ee428] iad [info]Last message: {:connection_error, %Mint.TransportError{reason: :closed}}
zachallaun commented 1 week ago

Note: Navigating to https://MY_APP.fly.dev/apps/partsbase-csv issues a 302 redirect to https://MY_APP.fly.dev.

jonatanklosko commented 1 week ago

@zachallaun if it's stuck at preparing it's most likely that Mix.install/2 OOMed. There is currently a bug where the OOM makes the deployment process stuck forever on the app server instance. You can bump memory and restart the server. I will work on a fix.

zachallaun commented 1 week ago

@jonatanklosko Thanks for the suggestion! I was previously running on 1gb but scaled to 4gb to be sure; unfortunately, it seems to still be stalled. I can't do anything in that deployed Livebook, like open a new notebook, but I think that's because it's an app server in read-only mode...?

jonatanklosko commented 1 week ago

@zachallaun oh that's weird, I could reproduce the stalled deployment, but the rest of the Livebook should be operational, like starting a new session. When opening a new notebook, are you getting an error, timeout, or something else? Anything curious in the Fly logs for that?

zachallaun commented 1 week ago

@jonatanklosko Getting a lot of these in the logs:

2024-06-25T18:40:48Z app[2874902a1ee428] iad [info] WARN Reaped child process with pid: 610 and signal: SIGUSR1, core dumped? false

My Fly.io metrics page shows memory usage averaging around ~300MB, but I know that that doesn't always capture memory spikes that lead to OOM issues.

jonatanklosko commented 1 week ago

If you deactivate the app on Livebook Teams and restart the machine, does it become operational?

zachallaun commented 1 week ago

It seems like it doesn't.

I'm going to try recreating the app completely and see if I can reproduce. This was an existing Livebook deployment (0.12.1) that I upgraded to 0.13.0 and set the various secrets for in order for it to be an app server. Perhaps there are some leftover gremlins in the bits that are causing mischief 😈

zachallaun commented 1 week ago

Okay, so I'm not sure what the issue was, but creating and connecting a completely new app server deployment seemed to work (and using shared-cpu-1x and 1gb). I'll compare the various deployment configs and will share if I figure out what was causing the issue.

zachallaun commented 1 week ago

Okay, so the fly.livebook.toml that I was using included the following env vars:

[env]
  ELIXIR_ERL_OPTIONS = '-proto_dist inet6_tcp'
  LIVEBOOK_DATA_PATH = '/data'
  LIVEBOOK_HOME = '/data'
  LIVEBOOK_IP = '::'
  LIVEBOOK_ROOT_PATH = '/data'
  PORT = '8080'

Deleting the entire [env] block and re-deploying seems to fix things, so I suppose at least one of those vars changed with 0.13.0 in a way that Livebook didn't like!

All seems to be well now.

josevalim commented 1 week ago

The issue was ELIXIR_ERL_OPTIONS = '-proto_dist inet6_tcp'. Do you know what could possibly be setting that?

jonatanklosko commented 1 week ago

Yeah it is, it is ELIXIR_ERL_OPTIONS. This specific env var is no longer passed to the runtime, so with that configuration Livebook would start with proto dist ipv6, while runtimes would start with proto dist ipv4, so connecting to the runtime would always timeout. This also explains why "New notebook" would be stuck, because the session tries to start the runtime upfront, but in this case it blocks until it times out.

zachallaun commented 1 week ago

Got it.

It was set that way based on these docs on Fly. Perhaps y'all can coordinate with some folks there to get those updated before 0.13 is widely announced. (And maybe worth making a note of it in the 0.13 CHANGELOG?)

josevalim commented 1 week ago

PR already sent to Fly!

hugobarauna commented 1 week ago

@zachallaun Docs on Fly are updated now. Thanks!