Closed doughsay closed 6 years ago
@doughsay Thanks for this detailed report! Could you help me with enabling debug logging and paste the log here?
@maennchen debug logging is enabled in my example repo, so following the steps in my readme should give you detailed output. It's very noisy, but here's an example output:
steps taken:
docker-compose up --build
(runs one node, PING messages show up every 3 seconds as expected)docker-compose up -d --scale web=3
(boots up 2 more nodes, usually causes scheduler to die, PING messages stop)@doughsay Thanks, this confirms my assumption. The Handoff of the ExecutionBroadcaster is broken. I'll have a look and will create a PR. Thanks again for the detailed report!
@doughsay I think I found and solved the issue. I'm a little low on time though and that's why I couldn't test it. Would you mind testing again with #369 ?
@maennchen thanks a lot for looking into this. Your branch is definitely better, but I was still able to cause a few error states. It's really hard to reproduce these because it just involves randomly scaling up and down to various cluster sizes.
I was able to cause two bad end states after scaling either up or down:
Two errors I saw were:
I think I quickly scaled from 10 to 1 nodes to cause this one
22:27:47.695 [error] GenServer QuantumSwarm.Scheduler.ExecutorSupervisor terminating
** (stop) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
Last message: {:DOWN, #Reference<0.2651763341.768868358.35813>, :process, #PID<32727.1361.0>, :noproc}
State: %ConsumerSupervisor{args: %Quantum.ExecutorSupervisor.InitOpts{cluster_task_supervisor_registry_reference: QuantumSwarm.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, execution_broadcaster_reference: {:via, :swarm, QuantumSwarm.Scheduler.ExecutionBroadcaster}, task_registry_reference: {:via, :swarm, QuantumSwarm.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumSwarm.Scheduler.Task.Supervisor}, children: %{}, max_restarts: 3, max_seconds: 5, mod: Quantum.ExecutorSupervisor, name: QuantumSwarm.Scheduler.ExecutorSupervisor, producers: %{}, restarting: 0, restarts: [], strategy: :one_for_one, template: {Quantum.Executor, {Quantum.Executor, :start_link, [%Quantum.Executor.StartOpts{cluster_task_supervisor_registry_reference: QuantumSwarm.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, task_registry_reference: {:via, :swarm, QuantumSwarm.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumSwarm.Scheduler.Task.Supervisor}]}, :temporary, 5000, :worker, [Quantum.Executor]}}
I think I quickly scaled from 1 to 10 nodes to cause this one:
22:32:26.455 [error] GenServer QuantumSwarm.Scheduler.ExecutionBroadcaster terminating
** (FunctionClauseError) no function clause matching in Quantum.ExecutionBroadcaster.handle_info/2
(quantum) lib/quantum/execution_broadcaster.ex:139: Quantum.ExecutionBroadcaster.handle_info({:swarm, :die}, %Quantum.ExecutionBroadcaster.State{debug_logging: true, jobs: [], scheduler: QuantumSwarm.Scheduler, storage: Quantum.Storage.Noop, time: ~N[2018-08-24 22:32:26.264193], timer: nil})
(gen_stage) lib/gen_stage.ex:2022: GenStage.noreply_callback/3
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:711: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:swarm, :die}
State: %Quantum.ExecutionBroadcaster.State{debug_logging: true, jobs: [], scheduler: QuantumSwarm.Scheduler, storage: Quantum.Storage.Noop, time: ~N[2018-08-24 22:32:26.264193], timer: nil}
That being said, I don't think I care that much about being able to spawn 10 nodes all at once really quickly, or shut down 10 nodes really quickly. We do slow rolling deploys in production where we wait 30 seconds between each 1 new node coming up and 1 old node shutting down.
So I tried going slowly and scaling 1 node up at a time and reaching 10 nodes total, and then I reversed and shutdown 1 node at a time slowly. I was unfortunately still able to produce an error that left the scheduler not running. Below I've pasted the logs of a 6 node cluster running and 1 node shutting down, leaving the scheduler not running and no jobs running at the end:
https://pastebin.com/mZdCh9Qp (I wasn't able to paste the logs as they were too big for github)
Ok, after much more tinkering and reading of the logs closely, it appears that sometimes during a node shutdown, it fails to pass off the state to another running node. The ExecutionBroadcaster
does look like it eventually gets restarted, but without any jobs registered. Should it not maybe re-add the jobs from the static config in cases such as those? It may also just be a coincidence, but I see if most often when going from 6 nodes to 5 by shutting one down.
See the pastebin logs from my previous comment; is my understanding of it correct that the Scheduler does eventually restart, but it's empty and has no jobs registered?
Overall though, your branch is much more stable and works 90% of the time. I'm just worried about the rolling deployment strategy we use which shuts down old nodes causing this scenario to happen.
@doughsay It's clear that this must work. The hard thing is that there's no testing strategy to ensure that it is really working.
I'm going to take a look again and hopefully fix everything.
I'm going to be busy over the weekend, therefore I'll have time to fix it monday the earliest.
I also think that your error makes sense. The handle_info
callback of the ExecutionBroadcaster
does not check if the timer
of the state is nil or not. Therefor Process.cancel_timer/1
is called with nil sometimes.
This could also lead to the ExecutorSupervisor
crashing since the ExecutionBroadcaster
which is its dependency crashed.
This one error was actually so easy to fix that I quickly implemented it. The PR is updated. You're welcome to test again. I'm going to have a look myself as well as soon as I have time again.
Thank you @maennchen! I'll have a look again on Monday as well.
I may have some more log output to help with this issue.
I had a docker-compose cluster of 2 nodes running and working (web_1 and web_2), then I attempted to scale to 3 nodes. The new node (web_3) timed out during boot and crashed. It left the cluster in a bad state and quantum was no longer running jobs. Here's the output:
web_2 | 23:40:45.001 [debug] [:"quantum_swarm_umbrella@172.23.0.3"][Elixir.Quantum.Executor] Task for job #Reference<32733.1346946828.911998984.31975> started on node :"quantum_swarm_umbrella@172.23.0.2"
web_1 | 23:40:45.003 [debug] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.Executor] Execute started for job #Reference<0.1346946828.911998984.31975>
web_1 | 23:40:45.003 [info] PING!
web_1 | 23:40:45.003 [debug] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.Executor] Execution ended for job #Reference<0.1346946828.911998984.31975>, which yielded result: :ok
web_1 | 23:40:48.000 [debug] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.ExecutionBroadcaster] Scheduling job for execution #Reference<0.1346946828.911998984.31975>
web_1 | 23:40:48.000 [debug] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.ExecutionBroadcaster] Continuing Execution Broadcasting at -576460538276 (2018-08-27T23:40:51)
web_2 | 23:40:48.011 [debug] [:"quantum_swarm_umbrella@172.23.0.3"][Elixir.Quantum.Executor] Task for job #Reference<32733.1346946828.911998984.31975> started on node :"quantum_swarm_umbrella@172.23.0.3"
web_2 | 23:40:48.012 [debug] [:"quantum_swarm_umbrella@172.23.0.3"][Elixir.Quantum.Executor] Execute started for job #Reference<32733.1346946828.911998984.31975>
web_2 | 23:40:48.012 [info] PING!
web_2 | 23:40:48.013 [debug] [:"quantum_swarm_umbrella@172.23.0.3"][Elixir.Quantum.Executor] Execution ended for job #Reference<32733.1346946828.911998984.31975>, which yielded result: :ok
web_2 | 23:40:48.069 [warn] [libcluster:dns] unable to connect to :"quantum_swarm_umbrella@172.23.0.4"
web_1 | 23:40:48.078 [warn] [libcluster:dns] unable to connect to :"quantum_swarm_umbrella@172.23.0.4"
web_3 | 23:40:49.403 [info] [swarm on quantum_swarm_umbrella@172.23.0.4] [tracker:init] started
web_1 | 23:40:51.000 [debug] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.ExecutionBroadcaster] Scheduling job for execution #Reference<0.1346946828.911998984.31975>
web_1 | 23:40:51.001 [debug] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.ExecutionBroadcaster] Continuing Execution Broadcasting at -576460535276 (2018-08-27T23:40:54)
web_2 | 23:40:51.001 [debug] [:"quantum_swarm_umbrella@172.23.0.3"][Elixir.Quantum.Executor] Task for job #Reference<32733.1346946828.911998984.31975> started on node :"quantum_swarm_umbrella@172.23.0.2"
web_1 | 23:40:51.003 [debug] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.Executor] Execute started for job #Reference<0.1346946828.911998984.31975>
web_1 | 23:40:51.003 [info] PING!
web_1 | 23:40:51.003 [debug] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.Executor] Execution ended for job #Reference<0.1346946828.911998984.31975>, which yielded result: :ok
web_2 | 23:40:53.049 [info] [libcluster:dns] connected to :"quantum_swarm_umbrella@172.23.0.4"
web_2 | 23:40:53.051 [info] [swarm on quantum_swarm_umbrella@172.23.0.3] [tracker:ensure_swarm_started_on_remote_node] nodeup quantum_swarm_umbrella@172.23.0.4
web_2 | 23:40:53.052 [debug] [swarm on quantum_swarm_umbrella@172.23.0.3] [tracker:handle_topology_change] topology change (nodeup for quantum_swarm_umbrella@172.23.0.4)
web_2 | 23:40:53.052 [info] [swarm on quantum_swarm_umbrella@172.23.0.3] [tracker:handle_topology_change] topology change complete
web_3 | 23:40:53.051 [info] [swarm on quantum_swarm_umbrella@172.23.0.4] [tracker:ensure_swarm_started_on_remote_node] nodeup quantum_swarm_umbrella@172.23.0.3
web_1 | 23:40:53.060 [info] [libcluster:dns] connected to :"quantum_swarm_umbrella@172.23.0.4"
web_3 | 23:40:53.061 [info] [swarm on quantum_swarm_umbrella@172.23.0.4] [tracker:ensure_swarm_started_on_remote_node] nodeup quantum_swarm_umbrella@172.23.0.2
web_1 | 23:40:53.061 [info] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:ensure_swarm_started_on_remote_node] nodeup quantum_swarm_umbrella@172.23.0.4
web_1 | 23:40:53.062 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] topology change (nodeup for quantum_swarm_umbrella@172.23.0.4)
web_1 | 23:40:53.062 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] #PID<0.1384.0> belongs on quantum_swarm_umbrella@172.23.0.4
web_1 | 23:40:53.062 [info] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.TaskRegistry] Handing of state to other cluster node
web_1 | 23:40:53.062 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] QuantumSwarm.Scheduler.TaskRegistry has requested to be resumed
web_2 | 23:40:53.063 [debug] [swarm on quantum_swarm_umbrella@172.23.0.3] [tracker:handle_replica_event] replica event: untrack #PID<32733.1384.0>
web_2 | 23:40:53.070 [debug] [swarm on quantum_swarm_umbrella@172.23.0.3] [tracker:handle_replica_event] replica event: untrack #PID<32733.1385.0>
web_2 | 23:40:53.075 [info] GenStage consumer QuantumSwarm.Scheduler.ExecutorSupervisor is stopping after receiving cancel from producer #PID<32733.1386.0> with reason: :shutdown
web_2 |
web_2 | 23:40:53.078 [debug] [swarm on quantum_swarm_umbrella@172.23.0.3] [tracker:handle_monitor] QuantumSwarm.Scheduler.ExecutionBroadcaster is down: :shutdown
web_1 | 23:40:53.069 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] sending handoff for QuantumSwarm.Scheduler.TaskRegistry to quantum_swarm_umbrella@172.23.0.4
web_2 | 23:40:53.079 [info] GenStage consumer QuantumSwarm.Scheduler.ExecutorSupervisor is stopping after receiving cancel from producer #PID<32733.1386.0> with reason: :noproc
web_2 |
web_2 | 23:40:53.079 [error] GenServer QuantumSwarm.Scheduler.ExecutorSupervisor terminating
web_2 | ** (stop) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
web_2 | Last message: {:DOWN, #Reference<0.3736789155.3865575430.239252>, :process, #PID<32733.1386.0>, :noproc}
web_2 | State: %ConsumerSupervisor{args: %Quantum.ExecutorSupervisor.InitOpts{cluster_task_supervisor_registry_reference: QuantumSwarm.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, execution_broadcaster_reference: {:via, :swarm, QuantumSwarm.Scheduler.ExecutionBroadcaster}, task_registry_reference: {:via, :swarm, QuantumSwarm.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumSwarm.Scheduler.Task.Supervisor}, children: %{}, max_restarts: 3, max_seconds: 5, mod: Quantum.ExecutorSupervisor, name: QuantumSwarm.Scheduler.ExecutorSupervisor, producers: %{}, restarting: 0, restarts: [], strategy: :one_for_one, template: {Quantum.Executor, {Quantum.Executor, :start_link, [%Quantum.Executor.StartOpts{cluster_task_supervisor_registry_reference: QuantumSwarm.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, task_registry_reference: {:via, :swarm, QuantumSwarm.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumSwarm.Scheduler.Task.Supervisor}]}, :temporary, 5000, :worker, [Quantum.Executor]}}
web_2 | 23:40:53.080 [debug] [swarm on quantum_swarm_umbrella@172.23.0.3] [tracker:handle_replica_event] replica event: untrack #PID<32733.1386.0>
web_2 | 23:40:53.080 [debug] [:"quantum_swarm_umbrella@172.23.0.3"][Elixir.Quantum.ExecutionBroadcaster] Unknown last execution time, using now
web_1 | 23:40:53.069 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] #PID<0.1385.0> belongs on quantum_swarm_umbrella@172.23.0.4
web_1 | 23:40:53.069 [info] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.JobBroadcaster] Handing of state to other cluster node
web_1 | 23:40:53.069 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] QuantumSwarm.Scheduler.JobBroadcaster has requested to be resumed
web_1 | 23:40:53.074 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] sending handoff for QuantumSwarm.Scheduler.JobBroadcaster to quantum_swarm_umbrella@172.23.0.4
web_1 | 23:40:53.074 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] #PID<0.1386.0> belongs on quantum_swarm_umbrella@172.23.0.3
web_1 | 23:40:53.074 [info] [:"quantum_swarm_umbrella@172.23.0.2"][Elixir.Quantum.ExecutionBroadcaster] Handing of state to other cluster node
web_1 | 23:40:53.074 [info] GenStage consumer QuantumSwarm.Scheduler.ExecutionBroadcaster is stopping after receiving cancel from producer #PID<0.1385.0> with reason: :shutdown
web_1 |
web_1 | 23:40:53.074 [info] GenStage consumer QuantumSwarm.Scheduler.ExecutorSupervisor is stopping after receiving cancel from producer #PID<0.1386.0> with reason: :shutdown
web_1 |
web_1 | 23:40:53.074 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] QuantumSwarm.Scheduler.ExecutionBroadcaster has requested to be resumed
web_1 | 23:40:53.079 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] sending handoff for QuantumSwarm.Scheduler.ExecutionBroadcaster to quantum_swarm_umbrella@172.23.0.3
web_1 | 23:40:53.079 [info] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] topology change complete
web_1 | 23:40:53.080 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_replica_event] replica event: untrack #PID<0.1386.0>
web_1 | 23:40:53.083 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_monitor] :"Elixir.#Reference<0.1346946828.911998984.32651>" is down: :shutdown
web_1 | 23:40:53.085 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_call] registering QuantumSwarm.Scheduler.TaskRegistry as process started by Elixir.Quantum.TaskRegistry.start_link/1 with args [%Quantum.TaskRegistry.StartOpts{name: QuantumSwarm.Scheduler.TaskRegistry}]
web_1 | 23:40:53.085 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:do_track] starting QuantumSwarm.Scheduler.TaskRegistry on remote node quantum_swarm_umbrella@172.23.0.4
web_3 | 23:40:54.371 [info] [swarm on quantum_swarm_umbrella@172.23.0.4] [tracker:cluster_wait] joining cluster..
web_3 | 23:40:54.371 [info] [swarm on quantum_swarm_umbrella@172.23.0.4] [tracker:cluster_wait] found connected nodes: [:"quantum_swarm_umbrella@172.23.0.2", :"quantum_swarm_umbrella@172.23.0.3"]
web_3 | 23:40:54.371 [info] [swarm on quantum_swarm_umbrella@172.23.0.4] [tracker:cluster_wait] selected sync node: quantum_swarm_umbrella@172.23.0.3
web_3 | 23:41:04.382 [info] Application quantum_swarm exited: QuantumSwarm.Application.start(:normal, []) returned an error: shutdown: failed to start child: QuantumSwarm.Scheduler
web_3 | ** (EXIT) exited in: :gen_statem.call(Swarm.Tracker, {:track, QuantumSwarm.Scheduler.TaskRegistry, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: QuantumSwarm.Scheduler.TaskRegistry}]}}}, 15000)
web_3 | ** (EXIT) time out
web_1 | 23:41:04.386 [error] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:start_pid_remotely] ** (exit) exited in: :gen_statem.call({Swarm.Tracker, :"quantum_swarm_umbrella@172.23.0.4"}, {:track, QuantumSwarm.Scheduler.TaskRegistry, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: QuantumSwarm.Scheduler.TaskRegistry}]}}}, :infinity)
web_1 | ** (EXIT) shutdown
web_1 | (stdlib) gen.erl:177: :gen.do_call/4
web_1 | (stdlib) gen_statem.erl:598: :gen_statem.call_dirty/4
web_1 | (swarm) lib/swarm/tracker/tracker.ex:1115: Swarm.Tracker.start_pid_remotely/6
web_1 | (elixir) lib/task/supervised.ex:89: Task.Supervised.do_apply/2
web_1 | (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
web_1 |
web_1 | 23:41:04.386 [warn] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:start_pid_remotely] failed to start QuantumSwarm.Scheduler.TaskRegistry on quantum_swarm_umbrella@172.23.0.4: {:shutdown, {:gen_statem, :call, [{Swarm.Tracker, :"quantum_swarm_umbrella@172.23.0.4"}, {:track, QuantumSwarm.Scheduler.TaskRegistry, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: QuantumSwarm.Scheduler.TaskRegistry}]}}}, :infinity]}}
web_1 | 23:41:04.387 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_call] registering QuantumSwarm.Scheduler.JobBroadcaster as process started by Elixir.Quantum.JobBroadcaster.start_link/1 with args [%Quantum.JobBroadcaster.StartOpts{debug_logging: true, jobs: [%Quantum.Job{name: #Reference<0.1346946828.911998979.32608>, overlap: true, run_strategy: %Quantum.RunStrategy.Random{nodes: :cluster}, schedule: ~e[*/3 * * * * * *]e, state: :active, task: {QuantumSwarm.Pinger, :ping, []}, timezone: :utc}], name: QuantumSwarm.Scheduler.JobBroadcaster, scheduler: QuantumSwarm.Scheduler, storage: Quantum.Storage.Noop}]
web_1 | 23:41:04.387 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:do_track] starting QuantumSwarm.Scheduler.JobBroadcaster on remote node quantum_swarm_umbrella@172.23.0.4
web_1 | 23:41:04.388 [warn] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:start_pid_remotely] remote tracker on quantum_swarm_umbrella@172.23.0.4 went down during registration, retrying operation..
...message repeated many times...
web_1 | 23:41:05.418 [warn] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:start_pid_remotely] remote tracker on quantum_swarm_umbrella@172.23.0.4 went down during registration, retrying operation..
web_1 | 23:41:05.418 [warn] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:start_pid_remotely] failed to start QuantumSwarm.Scheduler.JobBroadcaster on quantum_swarm_umbrella@172.23.0.4: nodedown, retrying operation..
web_1 | 23:41:05.419 [info] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:nodedown] nodedown quantum_swarm_umbrella@172.23.0.4
web_1 | 23:41:05.419 [debug] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] topology change (nodedown for quantum_swarm_umbrella@172.23.0.4)
web_1 | 23:41:05.419 [info] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:handle_topology_change] topology change complete
web_3 | {"Kernel pid terminated",application_controller,"{application_start_failure,quantum_swarm,{{shutdown,{failed_to_start_child,'Elixir.QuantumSwarm.Scheduler',{timeout,{gen_statem,call,['Elixir.Swarm.Tracker',{track,'Elixir.QuantumSwarm.Scheduler.TaskRegistry',#{mfa => {'Elixir.Quantum.TaskRegistry',start_link,[#{'__struct__' => 'Elixir.Quantum.TaskRegistry.StartOpts',name => 'Elixir.QuantumSwarm.Scheduler.TaskRegistry'}]}}},15000]}}}},{'Elixir.QuantumSwarm.Application',start,[normal,[]]}}}"}
web_3 | Kernel pid terminated (application_controller) ({application_start_failure,quantum_swarm,{{shutdown,{failed_to_start_child,'Elixir.QuantumSwarm.Scheduler',{timeout,{gen_statem,call,['Elixir.Swarm.Trac
web_3 |
web_3 | Crash dump is being written to: erl_crash.dump...done
quantum_swarm_umbrella_web_3 exited with code 1
@doughsay I believe that the error comes from swarm this time (the timeout). The QuantumSwarm.Scheduler.ExecutionBroadcaster
stopping is not really a problem since it has the value restart: :permanent
and will therefore connect again. This is also expected if the gen_stage consumers migrate to another place.
This part should however not happen:
web_3 | 23:41:04.382 [info] Application quantum_swarm exited: QuantumSwarm.Application.start(:normal, []) returned an error: shutdown: failed to start child: QuantumSwarm.Scheduler
web_3 | ** (EXIT) exited in: :gen_statem.call(Swarm.Tracker, {:track, QuantumSwarm.Scheduler.TaskRegistry, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: QuantumSwarm.Scheduler.TaskRegistry}]}}}, 15000)
web_3 | ** (EXIT) time out
web_1 | 23:41:04.386 [error] [swarm on quantum_swarm_umbrella@172.23.0.2] [tracker:start_pid_remotely] ** (exit) exited in: :gen_statem.call({Swarm.Tracker, :"quantum_swarm_umbrella@172.23.0.4"}, {:track, QuantumSwarm.Scheduler.TaskRegistry, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: QuantumSwarm.Scheduler.TaskRegistry}]}}}, :infinity)
web_1 | ** (EXIT) shutdown
web_1 | (stdlib) gen.erl:177: :gen.do_call/4
web_1 | (stdlib) gen_statem.erl:598: :gen_statem.call_dirty/4
web_1 | (swarm) lib/swarm/tracker/tracker.ex:1115: Swarm.Tracker.start_pid_remotely/6
web_1 | (elixir) lib/task/supervised.ex:89: Task.Supervised.do_apply/2
web_1 | (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
I will go ahead and get the PR merged and also open an issue with swarm.
The latest issue with swarm is further tracked with #374, the rest will be released with #373
We run our elixir apps in docker-compose for testing and development and in kubernetes for production. We also use libcluster to help our nodes discover one another. The new swarm-based version of quantum does't handle scaling in these environments very well.
I've reproduced a minimal test-case showing this behavior, please see my repo here: https://github.com/peek-travel/quantum_swarm If you have docker + compose installed it should be really easy to reproduce the error.
The setup is pretty simple: one global scheduler with one job that just prints "PING!" to stdout once every 3 seconds.
The error I get when scaling up the number of nodes is this:
Please have a look at the repo and let me know if you have any questions about what I did to set it up. I'll also be trying to look into the error above and see if I can figure out what's going on. If I do, I'll send a PR!
It should also be noted that quantum 2.2.7 does not have this bug, but has others, like while scaling up the number of nodes you might run the same job multiple times or when scaling down you might miss some. This isn't such a big deal in our case, so we are reverting to 2.2.7 until we can get to the bottom of the above error.