Open batpad opened 2 years ago
This repo is awesome! Thanks for getting this started. I also really like the name.
Re: GCP terraform , each time the orchestrator
FastAPI backend is released, the release script runs the terraform in pangeo-forge/dataflow-status-monitoring@github-app-hook to setup (or check, if it already exists) infrastructure required for sending job completion notifications back to ourselves when Dataflow jobs either succeed or fail.
That dataflow-status-monitoring
code is mounted in orchestrator
as a submodule, and called from here. Note that orchestrator
imports dataflow-status-monitoring
into a few different terraform environments (here), so that releasing development instances of the app doesn't inadvertently break the production infrastructure.
Now that the runner is using Flink (https://github.com/pangeo-forge/pangeo-forge-runner/pull/21), is any external Beam cluster (Dataflow on GPC) still needed?
I am still trying to understand the architecture reading https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html and https://beam.apache.org/documentation/runners/flink and I wonder if Beam is still in the picture or if Flink is enough to handle the jobs?
Well, I guess Dataflow is still needed, I am still trying to find where Flink is configured to use it.
Another question: any appetite to run Beam on Kuberternes and get rid of Dataflow like described in https://python.plainenglish.io/apache-beam-flink-cluster-kubernetes-python-a1965f37b7cb
Hi @echarles, thanks for chiming in here. This repo is a placeholder that we have not done much work on. Currently I can say we are interesting in supporting Flink in addition to Dataflow, but not as a replacement for it. Some basic Flink configuration can be found in these tests but we do not currently run any Flink in production. All of our production workloads are currently on Dataflow. If you're interested in participating in the conversation, we'd welcome you to join our recurring Pangeo Forge coordination call, which is listed on this calendar and also discussed here for any on-the-fly schedule adjustments.
Thx @cisaacstern I will join next Monday 2nd Jan meeting.
Great, @echarles! Looking forward to it.
Thx for the warm welcome at today meeting. I understand things evolve ATM with the introduction of the new GCP Cloud Runner. I guess my goal is to run on K8S the services and not depend on GCP. Is it already possible/documented? If not, what is missing to make this happen?
We should add a
gcp
directory in the terraform folder to provision Dataflow on GCP, so the setup of GCP bakeries can be managed within this repository.From chat with @yuvipanda - "it just needs to provision the service account for dataflow" .