pangeo-forge / pangeo-forge-cloud-federation

Infrastructure for running pangeo-forge across multiple bakeries
Apache License 2.0
3 stars 6 forks source link

Enable spot instances #11

Closed thodson-usgs closed 7 months ago

thodson-usgs commented 8 months ago

This PR enables optionally using spot instances. Ref #10.

NOTE: This deploys without error, but I haven't started testing the checkpointing yet.

Ideas for checkpoint tuning

References

thodson-usgs commented 7 months ago

@ranchodeluxe, @yuvipanda I've come to understand that Flink does not use checkpointing in batch mode, so that extra configuration is unnecessary. Running the cluster on spot instances may be as simple as setting capacity_type="SPOT". However, I have left the default as "ON_DEMAND" and will continue testing "SPOT" on the USGS runner.

One thing that is a little unclear is whether we need to configure the job manager and autoscaler to always run "ON_DEMAND". Some blogs suggest this is a best practice, but their fault-tolerance might be covered in the managed K8 service (EKS).

P.S. One more thought My limited understanding is that in batch mode, Flink well restart the job (task?) on failure. So, if a recipe includes a particularly long and costly job, like a big rechunk and transfer, it might be advisable to stick with "ON_DEMAND".

thodson-usgs commented 7 months ago

Digging in a little further. I think we would:

  1. leave the core node group as on demand
  2. create a second node group of spot instances
  3. configure autoscaler to scale each node group: job managers to core and task managers to spot. I'll need to investigate the last piece further.
thodson-usgs commented 7 months ago

Learning more, we might want to take advantage of Flink's High Availability (HA) Kubernetes Services: "The Operator supports both Kubernetes HA Services and Zookeeper HA Services for providing High-availability for Flink jobs. The HA solution can benefit form using additional Standby replicas, it will result in a faster recovery time, but Flink jobs will still restart when the Leader JobManager goes down."

Essentially, we would deploy an all-spot cluster and use standby replicas in case an instance terminates. So flink-config.yaml would include something like:

      # Enable HA cluster
      "high-availability.type" : "kubernetes",
      "high-availability.storageDir" : "s3://${aws_s3_bucket.flink_store.id}/recovery",
      "kubernetes.cluster-id" :  <cluster_id>,

where <cluster_id> is the id of the job manager (?). I don't think we need to worry about flink-operator because it's on the control plane, which is managed by AWS.

thodson-usgs commented 7 months ago

This PR will enable spot instances for testing. Before this is used in production, we'll want to make some additional changes to handle failures (this will serve "on demand" clusters too).

yuvipanda commented 7 months ago

Thanks @thodson-usgs