Save Jobs History on Flink

ranchodeluxe commented 1 year ago

Mount EFS to Job Managers so they can archive jobs for historical status lookups

Addresses: https://github.com/pangeo-forge/pangeo-forge-runner/issues/122

ranchodeluxe commented 1 year ago

So, EFS is NFS. And NFS is one of those 'you have a problem, you think you will use NFS, and now you have two problems' situations. It plays poorly with a lot of data formats that use any kinda file locking (see https://www.sqlite.org/howtocorrupt.html#_filesystems_with_broken_or_missing_lock_implementations), and the file corruption only shows up in the worst possible times. So I think the primary, and perhaps the only, time to use NFS (and hence EFS) is when providing home directories.

Given we already have the EBS provisioner setup and use it for prometheus, can we not use EBS here too? It does mean that only one pod can write to an EBS volume at a time, but relying on NFS for multiple-replica high availability eventually only leads to tears, pain, blood, stale file handle crashes and death.

Left some inline comments about the kubernetes provider.

Thanks for giving me the deep deets on why EFS/NFS is bad. I was going to use EBS but then I realized something when playing with multiple job managers that made me switch back to EFS:

There's no reason we need to start the historyserver as the docs recommend. It seems the job manager REST API serves the history API (that's how the job manager UI basically works)
More importantly even if a job manager DID NOT RUN a job it can still find the archived job in the EFS mount and return information about it. This is important b/c that means any of the existing job manager REST APIs can tell us about all history even if the job manager that specially ran a job is killed (hence needing multiple pods to have the EFS mount). In the future we are probably going to need to create some type of kind: Job || CronJob reaper that cleans up kind: FlinkDeployment on a regular basis. If we do that we can't expect job-manager pods to stick around anyway

Does any of that assuage your fears and persuade you one way or the other @yuvipanda?

ranchodeluxe commented 1 year ago

doh, so poor: https://github.com/hashicorp/terraform-provider-kubernetes/issues/1775#issuecomment-1193859982

maybe I just write a helm config since that works

yuvipanda commented 1 year ago

maybe I just write a helm config since that works

YESSS, I always prefer this over raw manifests :)

yuvipanda commented 1 year ago

Thanks for engaging with me on the EFS issues :) My goal here is not to say 'no EFS ever', but just to make sure we are only using it after we have completely determined that EBS is not an option.

So if I understand this correctly, the reason for EFS over EBS are:

Multiple pods may be writing to this filesystem. a. QUESTION: Will these be concurrently writing to the same filesystem, or non-concurrently? What is the 'level' of concurrency - one writer per job, or multiple writers per job? b. QUESTION: Will these multiple writers be writing to the same files, or different files? And concurrently, or serially?
Will this reaper process require direct read and write access to the files dropped there by the flink servers? I don't think I fully understand the relationship between the reaper and EFS.

I think answers to these questions will help me a lot :)

ranchodeluxe commented 1 year ago

1. Multiple pods may be writing to this filesystem.
   a. QUESTION: Will these be _concurrently_ writing to the same filesystem, or non-concurrently? What is the 'level' of concurrency - one writer per job, or multiple writers per job?

since jobs (and hence pods) can run concurrently then, yes, these will be writing to the same filesystem concurrently
I will need to look into the Flink source code more to discover how many writers per job. The logs make it seem like it's a single service handling the archival process so guessing one writer per job

   b. QUESTION: Will these multiple writers be writing to the _same_ files, or different files? And concurrently, or serially?

don't know the answer to this question until I investigate more
this question anticipates another thing to confirm in the Flink source -- how are Job IDs determined. Will they be unique across jobs (hence pods) or only unique per job manager? Or are they hashes of the source? If the Job IDs are not unique then multiple writers "could" be trying to write to the same file in the case of two jobs running simultaneously

2. Will this reaper process require direct read and write access to the files dropped there by the flink servers? I don't think I fully understand the relationship between the reaper and EFS.

No, the reaper process doesn't need to access the EFS mount. It's only checking kind: FlinkDeployment and their ages and then kubectl delete <kind:flinkdeployment> past some age expiry

ranchodeluxe commented 1 year ago

These clowns removed the 1.5.0 operator: https://downloads.apache.org/flink/flink-kubernetes-operator-1.5.0

ranchodeluxe commented 1 year ago

These clowns removed the 1.5.0 operator: https://downloads.apache.org/flink/flink-kubernetes-operator-1.5.0

Got confirmation from one of the devs that only the latest two operator versions are supported and one was just released. He's not sure if this documentation applies to the operators as well but it pretty much aligns:

https://flink.apache.org/downloads/#update-policy-for-old-releases

specific to the operator: https://cwiki.apache.org/confluence/display/FLINK/Release+Schedule+and+Planning

ranchodeluxe commented 12 months ago

Thanks for working with me on this, @ranchodeluxe. I think using EFS is alright here! I've left some other minor comments, but overall lgtm

Sorry @yuvipanda I thought I muted this by turning it back into a draft so it wouldn't ping you. I'll do that now (it still needs a bit of work) and I'll incorporate your feedback before requesting another review. Here are some answers to some previous questions:

Multiple pods may be writing to this filesystem. a. QUESTION: Will these be concurrently writing to the same filesystem, or non-concurrently? What is the 'level' of concurrency - one writer per job, or multiple writers per job?

The JobID(s) returned are statistically unique. And the writers of history to the NFS are a single process/thread

ranchodeluxe commented 11 months ago

@yuvipanda gentle nudge with some 🧁 for dessert 😄

ranchodeluxe commented 9 months ago

alrighty then @yuvipanda, back at this with recent changes so @thodson-usgs can use EFS

thodson-usgs commented 8 months ago

Looks good to me

pangeo-forge / pangeo-forge-cloud-federation

Save Jobs History on Flink #6