Closed thodson-usgs closed 9 months ago
@ranchodeluxe, @yuvipanda
I've come to understand that Flink does not use checkpointing in batch mode, so that extra configuration is unnecessary. Running the cluster on spot instances may be as simple as setting capacity_type="SPOT"
. However, I have left the default as "ON_DEMAND"
and will continue testing "SPOT"
on the USGS runner.
One thing that is a little unclear is whether we need to configure the job manager and autoscaler to always run "ON_DEMAND"
. Some blogs suggest this is a best practice, but their fault-tolerance might be covered in the managed K8 service (EKS).
P.S. One more thought
My limited understanding is that in batch mode, Flink well restart the job (task?) on failure. So, if a recipe includes a particularly long and costly job, like a big rechunk and transfer, it might be advisable to stick with "ON_DEMAND"
.
Digging in a little further. I think we would:
Learning more, we might want to take advantage of Flink's High Availability (HA) Kubernetes Services: "The Operator supports both Kubernetes HA Services and Zookeeper HA Services for providing High-availability for Flink jobs. The HA solution can benefit form using additional Standby replicas, it will result in a faster recovery time, but Flink jobs will still restart when the Leader JobManager goes down."
Essentially, we would deploy an all-spot cluster and use standby replicas in case an instance terminates. So flink-config.yaml
would include something like:
# Enable HA cluster
"high-availability.type" : "kubernetes",
"high-availability.storageDir" : "s3://${aws_s3_bucket.flink_store.id}/recovery",
"kubernetes.cluster-id" : <cluster_id>,
where <cluster_id>
is the id of the job manager (?). I don't think we need to worry about flink-operator because it's on the control plane, which is managed by AWS.
This PR will enable spot instances for testing. Before this is used in production, we'll want to make some additional changes to handle failures (this will serve "on demand" clusters too).
Thanks @thodson-usgs
This PR enables optionally using spot instances. Ref #10.
NOTE: This deploys without error, but I haven't started testing the checkpointing yet.
Ideas for checkpoint tuning
checkpointing.min-pause
References