pangeo-forge / pangeo-forge-cloud-federation

Infrastructure for running pangeo-forge across multiple bakeries
Apache License 2.0
3 stars 6 forks source link

Add terraform to setup Dataflow on GCP #2

Open batpad opened 2 years ago

batpad commented 2 years ago

We should add a gcp directory in the terraform folder to provision Dataflow on GCP, so the setup of GCP bakeries can be managed within this repository.

From chat with @yuvipanda - "it just needs to provision the service account for dataflow" .

cisaacstern commented 2 years ago

This repo is awesome! Thanks for getting this started. I also really like the name.

Re: GCP terraform , each time the orchestrator FastAPI backend is released, the release script runs the terraform in pangeo-forge/dataflow-status-monitoring@github-app-hook to setup (or check, if it already exists) infrastructure required for sending job completion notifications back to ourselves when Dataflow jobs either succeed or fail.

That dataflow-status-monitoring code is mounted in orchestrator as a submodule, and called from here. Note that orchestrator imports dataflow-status-monitoring into a few different terraform environments (here), so that releasing development instances of the app doesn't inadvertently break the production infrastructure.

echarles commented 1 year ago

Now that the runner is using Flink (https://github.com/pangeo-forge/pangeo-forge-runner/pull/21), is any external Beam cluster (Dataflow on GPC) still needed?

I am still trying to understand the architecture reading https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html and https://beam.apache.org/documentation/runners/flink and I wonder if Beam is still in the picture or if Flink is enough to handle the jobs?

echarles commented 1 year ago

Well, I guess Dataflow is still needed, I am still trying to find where Flink is configured to use it.

Another question: any appetite to run Beam on Kuberternes and get rid of Dataflow like described in https://python.plainenglish.io/apache-beam-flink-cluster-kubernetes-python-a1965f37b7cb

cisaacstern commented 1 year ago

Hi @echarles, thanks for chiming in here. This repo is a placeholder that we have not done much work on. Currently I can say we are interesting in supporting Flink in addition to Dataflow, but not as a replacement for it. Some basic Flink configuration can be found in these tests but we do not currently run any Flink in production. All of our production workloads are currently on Dataflow. If you're interested in participating in the conversation, we'd welcome you to join our recurring Pangeo Forge coordination call, which is listed on this calendar and also discussed here for any on-the-fly schedule adjustments.

echarles commented 1 year ago

Thx @cisaacstern I will join next Monday 2nd Jan meeting.

cisaacstern commented 1 year ago

Great, @echarles! Looking forward to it.

echarles commented 1 year ago

Thx for the warm welcome at today meeting. I understand things evolve ATM with the introduction of the new GCP Cloud Runner. I guess my goal is to run on K8S the services and not depend on GCP. Is it already possible/documented? If not, what is missing to make this happen?