pangeo-forge / pangeo-forge-cloud-federation

Infrastructure for running pangeo-forge across multiple bakeries
Apache License 2.0
3 stars 6 forks source link

WIP: Split job and task manager node groups #21

Closed thodson-usgs closed 6 months ago

thodson-usgs commented 7 months ago

This PR would split the job and task managers among two node groups, such that job managers could use On Demand nodes, whereas task managers could use Spot.

I'm not sure of the best strategy for using Spot, so I'm advancing this one for testing and comment.

thodson-usgs commented 7 months ago

😮‍💨 the node-selector tags are being ignored and all nodes are going to one node group. Attempted several fixes, but none were successful.

...wondering if this is a case where I need to pass the selector tag through config.py rather than flink-config.yaml

thodson-usgs commented 6 months ago

I'm stuck on flink-config.yaml. Using config.py might work, but this is quickly becoming an antipattern. Time to think about refactoring runner and flink-deployment, I think. Namely, we need to move the pod template from runner to deployment, which well decouple things and give us much more flexibility in optimizing the deployment. I believe Flink's session mode will be the easiest way to achieve this.