pangeo-forge / pangeo-forge-cloud-federation

Infrastructure for running pangeo-forge across multiple bakeries
Apache License 2.0
3 stars 6 forks source link

WIP: Session deployment #22

Open thodson-usgs opened 6 months ago

thodson-usgs commented 6 months ago

This PR would configure Flink to run in session mode. Essentially, it would create a single job manager for the cluster, and all pangeo-forge-recipes would submit their jobs to that job manager. One of the main advantages of this would be to centralize all infrastructure configuration configuration in pangeo-forge-cloud-federation. Currently, infrastructure is spread across pangeo-forge-cloud-federation, pangeo-forge-runner and within the individual recipe's config.py, and this makes it difficult to configure the cluster. Ideally, we could have multiple node pools of on demand and spot, instances, high-availability job managers, reactive scaling, default failure strategies, etc and set all that within pangeo-forge-cloud-federation. Then the recipe and pangeo-forge-runner require minimal configuration, like setting parallelism and the job name.