Connecting to Prefect-Cloud and CI

jhamman commented 4 years ago

I've thrown this feedstock together to help us explore how to connect a CI system (e.g. GitHub Actions) to a Dask-Gateway Cluster / Prefect Cloud. The repository is laid out as follows:

recipe.pipeline - A python module ready with a PangeoForge.Pipeline object in it. Import as from recipe.pipeline import pipeline.
run.py - A simple python script that imports the above described pipeline and (for now) prints some info about the flow contained therein.
.github.workflows.main - A github action that installs a few dependencies and executes run.py.

@TomAugspurger - hopefully this has the needed pieces to connect to what you were working on with Prefect Cloud.

cc @rabernat

TomAugspurger commented 4 years ago

Thanks. I need to run in a few minutes, but my WIP is at https://github.com/TomAugspurger/prefect-demo.

I'm close-ish to getting something working, but the handling of DaskExecutor <-> Dask Cluster on our k8s cluster is a bit awkward. May be best to sync up sometime in the next couple days to talk things through.

jhamman commented 4 years ago

Great! I'm around the next few days so just ping me when you're ready to chat.

Also, just dropping this here for reference. On a pangeo hub, dropping the following bit into run.py will allow you to run the hub "manually":

    gateway = Gateway()
    options = gateway.cluster_options()
    options.worker_cores = 8
    options.worker_memory = 51
    cluster = gateway.new_cluster(cluster_options=options)
    cluster.scale(4)
    executor = DaskExecutor(
        address=cluster.scheduler_address,
        client_kwargs={"security": cluster.security}
    )
    pipeline.flow.run(executor=executor, )

TomAugspurger commented 4 years ago

@jhamman are you free around 12:30 Central / 10:30 Pacific?

I think Ryan was right about DaskGateway creating unnecessary difficulties. I'm able to get the DaskKubernetesExecutor working OK. Working to port this workflow now, and I'll make a PR.

jhamman commented 4 years ago

Yes I am!

TomAugspurger commented 4 years ago

I was able to complete this flow using my prefect cloud account and our staging k8s cluster.

One-time setup + some stuff that's probably per k8s-namespace setup. Note that some names may be different (previously I used the pangeo-forge bucket, now it's pangeo-forge-scratch.

1. Create an account with prefect-cloud (will do this with the pangeo bot) 2. Create an auth token ```console $ prefect auth create-token -n pangeo-forge-token --scope=RUNNER ``` 3. Install prefect agent ```console $ prefect agent install kubernetes -t -l github-flow-storage -l gcp --rbac --namespace=staging --image-pull-policy=Always | kubectl apply -n staging -f -staging -f - deloyment.apps/prefect-agent configured role.rbac.authorization.k8s.io/prefect-agent-rbac created rolebinding.rbac.authorization.k8s.io/prefect-agent-rbac created ``` Note: I tried adding a `--label=...` but that messed up prefect. The flow just didn't run. Note: We'll need to repeat this for prod. 4. Add a Kubernetes ServiceAccount This needs to have read / write access to pangeo-scratch (like `pangeo`) *and* have the ability to start / stop pods for dask-kubernetes. We don't want to give the `pangeo` SA permissions to read / write, so we'll make a new one. ```console $ kubectl apply -f daskkubernetes.yaml serviceaccount/pangeo-forge created role.rbac.authorization.k8s.io/pangeo-forge created rolebinding.rbac.authorization.k8s.io/pangeo-forge created ``` 5. Create a GSA ```console $ gcloud iam service-accounts create pangeo-forge --display-name=pangeo-forge --description="GSA for pangeo-forge. Grant read / write access to gcs://pangeo-scratch." Created service account [pangeo-forge]. ``` ```console $ gcloud iam service-accounts add-iam-policy-binding \ --role roles/iam.workloadIdentityUser \ --member "serviceAccount:pangeo-181919.svc.id.goog[staging/pangeo-forge]" \ pangeo-forge@pangeo-181919.iam.gserviceaccount.com Updated IAM policy for serviceAccount [pangeo-forge@pangeo-181919.iam.gserviceaccount.com]. bindings: - members: - serviceAccount:pangeo-181919.svc.id.goog[staging/pangeo-forge] role: roles/iam.workloadIdentityUser etag: BwWuWOjv5tg= version: 1 ``` Grant read / write roles to that service account ```python gsutil iam ch serviceAccount:pangeo-forge@pangeo-181919.iam.gserviceaccount.com:roles/storage.objectAdmin gs://pangeo-forge-scratch ```

Per-flow things

Register the flow

It's unclear how best to do this. For now, calling it in python code works

>>> pipeline = TerraclimatePipeline(cache_location, target_location, variables, years)
>>> pipeline.flow.register(project="pangeo-forge")

I think that ideally this is done in the GitHub Action. The github worker just needs a token to log into our prefect cloud account.

Execute the flow

This probably just happens on pushes to master since this is the computationally expensive part?

$ prefect run flow --name=terraclimate --project=pangeo-forge

This is also done in the GitHub Action worker.

At this point, a bunch of stuff happens on the k8s cluster

The agent sees the flow run, starts a prefect-job
The prefect job does something, starts a prefect-dask-job
This runs the actual tasks

Random thoughts

The Flow object needs to have the storage and environment (/executor) baked into it when it's registered. How best to do that? Ideally recipe authors don't need to worry about it.
On environments, do we have a standard pangeo-forge environment? Recipes can specify extra packages to install on top.
We have to specify worker_pod.yaml and job.yaml just to set serviceAccount and serviceAccountName. Would be nice to avoid that (open issue at prefect)

jhamman commented 4 years ago

@TomAugspurger - amazing! How should we go about pulling in your proof of concept here? As it stands, I think this repo is a good candidate location for a place to test out a full CI workflow. I can spend some cycles on the GitHub actions side tomorrow and Friday if you want to push your modifications (storage/environment/register) to master here.

TomAugspurger commented 4 years ago

I think hooking up to the CI is the next step.

Get a login token from cloud.prefect.io (you should have an invite, or log in through the bot's GitHub)
Add that to this repository's / org's secrets
Add actions to log in, register, and run the flow.

After that it'd be good to think through how we structure things like the environment and storage.

jhamman commented 4 years ago

@TomAugspurger - I think I'm just missing the storage and environment setups you had in your flow. Once I have those, I think we'll be all set.

TomAugspurger commented 4 years ago

Whoops, https://github.com/TomAugspurger/terraclimate-feedstock/commit/78b127eee021404dc0831b679710a17c3379bf99 has all that (and probably some other stuff).

TomAugspurger commented 4 years ago

I'm registering the flow with python recipe/pipeline.py, but that would probably belong in an external file. You just need to somehow get the Flow object at the top-level of the module I think.

jhamman commented 4 years ago

🎉 We're up and rolling on GitHub actions: https://github.com/pangeo-forge/terraclimate-feedstock/runs/1067909406?check_suite_focus=true

However, the job failed on the prefect side:

Task Run State Message:
Unexpected error: OSError('Forbidden: https://www.googleapis.com/storage/v1/b/pangeo-scratch/o/terraclimate-cache%2F8052597365796761289\nPrimary: /namespaces/pangeo-181919.svc.id.goog with additional claims does not have storage.objects.get access to the Google Cloud Storage object.')

Everything is in master here now. A few notes about what I did:

Created a pangeoforge DockerHub account: https://hub.docker.com/orgs/pangeoforge
Added GitHub secretes: PREFECT_ACCESS_TOKEN, DOCKER_USER, DOCKER_ACCESS_TOKEN
Updated the dockerfile to use JUPYTERHUB_USER=pangeoforge and the Docker storage object to point to the pangeoforge DockerHub account.

@TomAugspurger - I know you were working on the service account permissions yesterday. Any hints on what to do next?

TomAugspurger commented 4 years ago

Cool... I think I was surprised about not having to add any gcs-specific permissions to the pangeo-forge Google Service Account. Can you try poking around there?

Though that doesn't explain why it worked yesterday. Maybe it didn't actually work?

TomAugspurger commented 4 years ago

For reference, pangeo has these

Screen Shot 2020-09-03 at 11 59 44 AM

pangeo-forge doesn't. Maybe add those roles?

jhamman commented 4 years ago

@TomAugspurger - I'm at a loss I think. I now have things set up as:

Yet, I'm still getting permission errors. How do we debug this?

TomAugspurger commented 4 years ago

Looks like fs = gcsfs.GCSFileSystem(token="cloud") was needed.

On Thu, Sep 3, 2020 at 1:51 PM Joe Hamman notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger - I'm at a loss I think. I now have things set up as: [image: image] https://user-images.githubusercontent.com/2443309/92155097-af20bb00-eddb-11ea-9566-82b3ed56b8b0.png

Yet, I'm still getting permission errors. How do we debug this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-forge/terraclimate-feedstock/issues/3#issuecomment-686693003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIVWHNFNKBJPXNV5EVTSD7QTHANCNFSM4QSH72QQ .

jhamman commented 4 years ago

Hmmm, how do we specify this with fsspec?

TomAugspurger commented 4 years ago

I'm not sure, but possibly setting GCSFS_DEFAULT_PROJECT in https://github.com/dask/gcsfs/blob/e142fca992556479930083363e87b8f9509f6175/gcsfs/core.py#L80? I'm really not sure how this works on the hubs. I would think it's the same.

That's assuming we aren't able to use storage_options.

jhamman commented 4 years ago

After a few small fixes to the pipeline, we're there!

rabernat commented 4 years ago

Amazing!

The pangeo-scratch bucket is globally r/w for all gcs Pangeo users. We should set up a new bucket and link it to the service account for pangeo-forge.

jhamman commented 4 years ago

@TomAugspurger - is the prefect agent still alive? I tried testing the full dataset pipeline (all years, all vars) and the github action is getting a timeout.

jhamman commented 4 years ago

The pangeo-scratch bucket is globally r/w for all gcs Pangeo users. We should set up a new bucket and link it to the service account for pangeo-forge.

I've added a pangeo-forge-scratch bucket to GCS and given the pangeo-forge service account read/write access.

TomAugspurger commented 4 years ago

It is alive.

These are the logs from the kubernetes pod started by prefect.

/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py:277: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. 
  "The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 295, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/bin/dask-worker", line 11, in <module>
    sys.exit(go())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 446, in go
    main()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 432, in main
    loop.run_sync(run)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 532, in run_sync
    return future_cell[0].result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 426, in run
    await asyncio.gather(*nannies)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 301, in _
    type(self).__name__, timeout
concurrent.futures._base.TimeoutError: Nanny failed to start in 60 seconds

I don't see those logs anywhere in the prefect UI. Perhaps because this is from before the job really gets started since the scheduler or workers failed to start.

TomAugspurger commented 4 years ago

I'll try to figure this out before the call today.

TomAugspurger commented 4 years ago

Fixed the permissions with gsutil iam ch serviceAccount:pangeo-forge@pangeo-181919.iam.gserviceaccount.com:roles/storage.objectAdmin gs://pangeo-forge-scratch

TomAugspurger commented 3 years ago

FYI, I modified the GCP settings yesterday. There was a "Bucket retention policy" set on pangeo-forge-scratch which prevented objects from being deleted. We instead want a lifecycle policy on the scratch bucket that cleans things up.

Presumably, we'll also want a non-scratch bucket.

$ gsutil lifecycle set lifecycle.json gs://pangeo-forge-scratch
Setting lifecycle configuration on gs://pangeo-forge-scratch/...

$ cat lifecycle.json
{
  "lifecycle": {
    "rule": [
      {
        "action": {
          "type": "Delete"
        },
        "condition": {
          "age": 7,
          "isLive": true
        }
      }
    ]
  }
}

pangeo-forge / terraclimate-feedstock-archive

Connecting to Prefect-Cloud and CI #3