rakutentech / shibuya

Apache License 2.0
23 stars 9 forks source link

Shibuya distributed mode #19

Open iandyh opened 3 years ago

iandyh commented 3 years ago

A big one. Let me put them into smaller tasks:

iandyh commented 2 years ago
c.resumeRunningPlans()
go c.streamToApi()
go c.readConnectedEngines()
go c.checkRunningThenTerminate()
go c.fetchEngineMetrics()
go c.cleanLocalStore()
go c.autoPurgeDeployments()

Because of these goroutines, currently controller is not stateless. This is the path we can follow:

[1]. If all of the stateful logic could be moved to workers. [2]. If not, then some kind of leader election might be required. So only the leader will do the stateful work and others will just handle API requests.

If we need 2, then we need to consider what happens during leader election, for example, during release or the leader goes down.

Another challenge is that once we have replicas for controller, Prom could not get the metrics.

resumeRunningPlans

This is required for continue reading the metrics when the controller process gets restarted. [1]

streamToApi

This is for raw metrics streaming. Currently the metrics are collected in heap memory. We need the workers to report the metrics to a broker(Redis is a good candidate) and let the controller be the consumer. [1]

checkRunningThenTerminate

We track the progress of running plan and stop(gc) everything when the duration is reached. Currently we fetch all the running plans. Seems pretty difficult to move such logic into worker. [2]

fetchEngineMetrics

This is for showing the engine metric usage in the executors side. Currently we fetch all the engines by GetDeployedCollection method. Also difficult to move to workers. [2] We actually could not keep this method in the controller because Prom could not fetch the metrics once we scale up the controller.

cleanLocalStore

This is to clean Prom data. Easy to move. [1]

autoPurgeDeployments

This is the GC process to clean idle engines. We use GetDeployedCollection to all the engines and then filter. [2]

iandyh commented 1 year ago

Before going into details, there are also some items needs to done.