quantum-elixir / quantum-core

:watch: Cron-like job scheduler for Elixir
https://hexdocs.pm/quantum/
Apache License 2.0
2.33k stars 149 forks source link

Problem: There is one master node by default when using cluster? #380

Closed xinz closed 6 years ago

xinz commented 6 years ago

When setup multi nodes at local, e.g.

iex --name ac1@127.0.0.1 -S mix iex --name ac2@127.0.0.1 -S mix iex --name ac3@127.0.0.1 -S mix

After running these nodes and manually ping them for each other.

Scheduler

defmodule QuantumDemo.Scheduler do
  use Quantum.Scheduler, otp_app: :quantum_demo
end

config.exs

config :swarm,
  node_whitelist: [~r/^ac[\d]@.*$/]

config :quantum_demo, QuantumDemo.Scheduler,
  global: true,
  overlap: false,
  jobs: [
    [
      name: "cleaner",
      schedule: "* * * * *",
      task: {QuantumDemo, :run_cleaner, []},
      run_strategy: {Quantum.RunStrategy.Random, :cluster}
    ],
  ]

Using: gen_stage: 0.14.1 quantum: 2.3.3 swarm: 3.3.1

From observer's perspective, I find in swarm application, there is one "master" node (ac1@127.0.0.1, see the following screenshot) with JobBroadcaster/ExecutionBroadcaster/TaskRegistry

image

image

The prepared processes for quantum_demo are both ready:

image

image

If I close ac2/ac3 node, this application still working, the restarted and joined nodes can keep run the job working as global.

If I close ac1@127.0.0.1 node, the reset cluster nodes won't work together any more with the following error:

2018-10-23 18:30:12.759 [warn] [swarm on ac2@127.0.0.1] [tracker:broadcast_event] broadcast of event ({:untrack, #PID<21404.219.0>}) was not recevied by [:"ac1@127.0.0.1"]
2018-10-23 18:30:12.759 [debug] [swarm on ac2@127.0.0.1] [tracker:handle_monitor] lost connection to QuantumDemo.Scheduler.TaskRegistry (#PID<21404.216.0>) on ac1@127.0.0.1, node is down
2018-10-23 18:30:12.759 [info] [swarm on ac2@127.0.0.1] [tracker:nodedown] nodedown ac1@127.0.0.1
2018-10-23 18:30:12.759 [debug] [swarm on ac2@127.0.0.1] [tracker:handle_topology_change] topology change (nodedown for ac1@127.0.0.1)
2018-10-23 18:30:12.759 [debug] [swarm on ac2@127.0.0.1] [tracker:handle_topology_change] restarting QuantumDemo.Scheduler.ExecutionBroadcaster on ac2@127.0.0.1
2018-10-23 18:30:12.760 [debug] [swarm on ac2@127.0.0.1] [tracker:do_track] starting QuantumDemo.Scheduler.ExecutionBroadcaster on ac2@127.0.0.1
2018-10-23 18:30:12.760 [debug] [:"ac2@127.0.0.1"][Elixir.Quantum.ExecutionBroadcaster] Unknown last execution time, using now
2018-10-23 18:30:12.761 [debug] [swarm on ac2@127.0.0.1] [tracker:do_track] started QuantumDemo.Scheduler.ExecutionBroadcaster on ac2@127.0.0.1
2018-10-23 18:30:12.761 [debug] [swarm on ac2@127.0.0.1] [tracker:handle_topology_change] restarting QuantumDemo.Scheduler.JobBroadcaster on ac2@127.0.0.1
2018-10-23 18:30:12.761 [debug] [swarm on ac2@127.0.0.1] [tracker:do_track] starting QuantumDemo.Scheduler.JobBroadcaster on ac2@127.0.0.1
2018-10-23 18:30:12.761 [debug] [:"ac2@127.0.0.1"][Elixir.Quantum.JobBroadcaster] Loading Initial Jobs from Config
2018-10-23 18:30:12.761 [debug] [swarm on ac2@127.0.0.1] [tracker:do_track] started QuantumDemo.Scheduler.JobBroadcaster on ac2@127.0.0.1
2018-10-23 18:30:12.761 [debug] [swarm on ac2@127.0.0.1] [tracker:handle_topology_change] restarting QuantumDemo.Scheduler.TaskRegistry on ac2@127.0.0.1
2018-10-23 18:30:12.761 [debug] [swarm on ac2@127.0.0.1] [tracker:do_track] starting QuantumDemo.Scheduler.TaskRegistry on ac2@127.0.0.1
2018-10-23 18:30:12.762 [debug] [swarm on ac2@127.0.0.1] [tracker:do_track] started QuantumDemo.Scheduler.TaskRegistry on ac2@127.0.0.1
2018-10-23 18:30:12.762 [info] [swarm on ac2@127.0.0.1] [tracker:handle_topology_change] topology change complete
2018-10-23 18:30:12.763 [info] GenStage consumer QuantumDemo.Scheduler.ExecutionBroadcaster is stopping after receiving cancel from producer #PID<21404.217.0> with reason: :noconnection

2018-10-23 18:30:12.786 [error] GenServer QuantumDemo.Scheduler.ExecutorSupervisor terminating
** (stop) no connection
Last message: {:DOWN, #Reference<0.1873331614.2499280897.10951>, :process, #PID<21404.218.0>, :noconnection}
State: %ConsumerSupervisor{args: %Quantum.ExecutorSupervisor.InitOpts{cluster_task_supervisor_registry_reference: QuantumDemo.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, execution_broadcaster_reference: {:via, :swarm, QuantumDemo.Scheduler.ExecutionBroadcaster}, task_registry_reference: {:via, :swarm, QuantumDemo.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumDemo.Scheduler.Task.Supervisor}, children: %{}, max_restarts: 3, max_seconds: 5, mod: Quantum.ExecutorSupervisor, name: QuantumDemo.Scheduler.ExecutorSupervisor, producers: %{}, restarting: 0, restarts: [], strategy: :one_for_one, template: {Quantum.Executor, {Quantum.Executor, :start_link, [%Quantum.Executor.StartOpts{cluster_task_supervisor_registry_reference: QuantumDemo.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, task_registry_reference: {:via, :swarm, QuantumDemo.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumDemo.Scheduler.Task.Supervisor}]}, :temporary, 5000, :worker, [Quantum.Executor]}}
2018-10-23 18:30:12.802 [error] GenServer QuantumDemo.Scheduler.ExecutionBroadcaster terminating
** (stop) no connection
Last message: {:DOWN, #Reference<0.1873331614.2499280897.16020>, :process, #PID<21404.217.0>, :noconnection}
State: %Quantum.ExecutionBroadcaster.State{debug_logging: true, jobs: [], scheduler: QuantumDemo.Scheduler, storage: Quantum.Storage.Noop, time: ~N[2018-10-23 10:30:12.760287], timer: nil}
2018-10-23 18:30:12.802 [debug] [swarm on ac2@127.0.0.1] [tracker:handle_monitor] lost connection to QuantumDemo.Scheduler.ExecutionBroadcaster (#PID<0.351.0>) on ac2@127.0.0.1, node is down
2018-10-23 18:30:12.803 [info] GenStage consumer QuantumDemo.Scheduler.ExecutorSupervisor is stopping after receiving cancel from producer #PID<0.351.0> with reason: :noconnection

2018-10-23 18:30:12.804 [error] GenServer QuantumDemo.Scheduler.ExecutorSupervisor terminating
** (stop) no connection
Last message: {:DOWN, #Reference<0.1873331614.2499280898.12891>, :process, #PID<0.351.0>, :noconnection}
State: %ConsumerSupervisor{args: %Quantum.ExecutorSupervisor.InitOpts{cluster_task_supervisor_registry_reference: QuantumDemo.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, execution_broadcaster_reference: {:via, :swarm, QuantumDemo.Scheduler.ExecutionBroadcaster}, task_registry_reference: {:via, :swarm, QuantumDemo.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumDemo.Scheduler.Task.Supervisor}, children: %{}, max_restarts: 3, max_seconds: 5, mod: Quantum.ExecutorSupervisor, name: QuantumDemo.Scheduler.ExecutorSupervisor, producers: %{}, restarting: 0, restarts: [], strategy: :one_for_one, template: {Quantum.Executor, {Quantum.Executor, :start_link, [%Quantum.Executor.StartOpts{cluster_task_supervisor_registry_reference: QuantumDemo.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, task_registry_reference: {:via, :swarm, QuantumDemo.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumDemo.Scheduler.Task.Supervisor}]}, :temporary, 5000, :worker, [Quantum.Executor]}}
2018-10-23 18:30:12.805 [info] GenStage consumer QuantumDemo.Scheduler.ExecutorSupervisor is stopping after receiving cancel from producer #PID<0.351.0> with reason: :noproc

...
...

2018-10-23 18:30:12.827 [error] GenServer QuantumDemo.Scheduler.ExecutorSupervisor terminating
** (stop) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
Last message: {:DOWN, #Reference<0.1873331614.2499280898.13046>, :process, #PID<0.351.0>, :noproc}
State: %ConsumerSupervisor{args: %Quantum.ExecutorSupervisor.InitOpts{cluster_task_supervisor_registry_reference: QuantumDemo.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, execution_broadcaster_reference: {:via, :swarm, QuantumDemo.Scheduler.ExecutionBroadcaster}, task_registry_reference: {:via, :swarm, QuantumDemo.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumDemo.Scheduler.Task.Supervisor}, children: %{}, max_restarts: 3, max_seconds: 5, mod: Quantum.ExecutorSupervisor, name: QuantumDemo.Scheduler.ExecutorSupervisor, producers: %{}, restarting: 0, restarts: [], strategy: :one_for_one, template: {Quantum.Executor, {Quantum.Executor, :start_link, [%Quantum.Executor.StartOpts{cluster_task_supervisor_registry_reference: QuantumDemo.Scheduler.ClusterTaskSupervisorRegistry, debug_logging: true, task_registry_reference: {:via, :swarm, QuantumDemo.Scheduler.TaskRegistry}, task_supervisor_reference: QuantumDemo.Scheduler.Task.Supervisor}]}, :temporary, 5000, :worker, [Quantum.Executor]}}
2018-10-23 18:30:12.827 [debug] [swarm on ac2@127.0.0.1] [tracker:handle_monitor] :"Elixir.#Reference<0.1873331614.2499280898.13027>" is down: :shutdown
2018-10-23 18:30:12.829 [info] Application quantum_demo exited: shutdown

Could you please advise this is a known behavior/design?

If quantum elects one node as "master" by default, can we configure more nodes to be "master" for the availability of application?

maennchen commented 6 years ago

@xinz This is a known issue (#374, https://github.com/quantum-elixir/quantum-core/issues/368). I didn't have the time at the moment to come up with a final solution.

I'm going to close the issue as a duplicate therefore.

If quantum elects one node as "master" by default, can we configure more nodes to be "master" for the availability of application?

This is not possible with the current design of the application. If we wanted to implement it like that we'd need a much more complicated setup.

If you would want to have multi master as a feature and are willing to help out with the implementation, feel free to open an issue and start the discussion regarding that.

xinz commented 6 years ago

@maennchen thanks, actually I hope any node shutdown won't affect the job(s) continue working on the rest nodes(if existed), each node of the clusters should be the same role/priority to process the distribution of worker processes.

I will keep watch the related issue.