pangeo-forge / terraclimate-feedstock-archive

A pangeo-smithy repository for the terraclimate dataset.
Apache License 2.0
3 stars 3 forks source link

Connecting to Prefect-Cloud and CI #3

Open jhamman opened 4 years ago

jhamman commented 4 years ago

I've thrown this feedstock together to help us explore how to connect a CI system (e.g. GitHub Actions) to a Dask-Gateway Cluster / Prefect Cloud. The repository is laid out as follows:

@TomAugspurger - hopefully this has the needed pieces to connect to what you were working on with Prefect Cloud.

cc @rabernat

TomAugspurger commented 4 years ago

Thanks. I need to run in a few minutes, but my WIP is at https://github.com/TomAugspurger/prefect-demo.

I'm close-ish to getting something working, but the handling of DaskExecutor <-> Dask Cluster on our k8s cluster is a bit awkward. May be best to sync up sometime in the next couple days to talk things through.

jhamman commented 4 years ago

Great! I'm around the next few days so just ping me when you're ready to chat.

Also, just dropping this here for reference. On a pangeo hub, dropping the following bit into run.py will allow you to run the hub "manually":

    gateway = Gateway()
    options = gateway.cluster_options()
    options.worker_cores = 8
    options.worker_memory = 51
    cluster = gateway.new_cluster(cluster_options=options)
    cluster.scale(4)
    executor = DaskExecutor(
        address=cluster.scheduler_address,
        client_kwargs={"security": cluster.security}
    )
    pipeline.flow.run(executor=executor, )
TomAugspurger commented 4 years ago

@jhamman are you free around 12:30 Central / 10:30 Pacific?

I think Ryan was right about DaskGateway creating unnecessary difficulties. I'm able to get the DaskKubernetesExecutor working OK. Working to port this workflow now, and I'll make a PR.

jhamman commented 4 years ago

Yes I am!

TomAugspurger commented 4 years ago

I was able to complete this flow using my prefect cloud account and our staging k8s cluster.

One-time setup + some stuff that's probably per k8s-namespace setup. Note that some names may be different (previously I used the pangeo-forge bucket, now it's pangeo-forge-scratch.

1. Create an account with prefect-cloud (will do this with the pangeo bot) 2. Create an auth token ```console $ prefect auth create-token -n pangeo-forge-token --scope=RUNNER ``` 3. Install prefect agent ```console $ prefect agent install kubernetes -t -l github-flow-storage -l gcp --rbac --namespace=staging --image-pull-policy=Always | kubectl apply -n staging -f -staging -f - deloyment.apps/prefect-agent configured role.rbac.authorization.k8s.io/prefect-agent-rbac created rolebinding.rbac.authorization.k8s.io/prefect-agent-rbac created ``` Note: I tried adding a `--label=...` but that messed up prefect. The flow just didn't run. Note: We'll need to repeat this for prod. 4. Add a Kubernetes ServiceAccount This needs to have read / write access to pangeo-scratch (like `pangeo`) *and* have the ability to start / stop pods for dask-kubernetes. We don't want to give the `pangeo` SA permissions to read / write, so we'll make a new one. ```console $ kubectl apply -f daskkubernetes.yaml serviceaccount/pangeo-forge created role.rbac.authorization.k8s.io/pangeo-forge created rolebinding.rbac.authorization.k8s.io/pangeo-forge created ``` 5. Create a GSA ```console $ gcloud iam service-accounts create pangeo-forge --display-name=pangeo-forge --description="GSA for pangeo-forge. Grant read / write access to gcs://pangeo-scratch." Created service account [pangeo-forge]. ``` ```console $ gcloud iam service-accounts add-iam-policy-binding \ --role roles/iam.workloadIdentityUser \ --member "serviceAccount:pangeo-181919.svc.id.goog[staging/pangeo-forge]" \ pangeo-forge@pangeo-181919.iam.gserviceaccount.com Updated IAM policy for serviceAccount [pangeo-forge@pangeo-181919.iam.gserviceaccount.com]. bindings: - members: - serviceAccount:pangeo-181919.svc.id.goog[staging/pangeo-forge] role: roles/iam.workloadIdentityUser etag: BwWuWOjv5tg= version: 1 ``` Grant read / write roles to that service account ```python gsutil iam ch serviceAccount:pangeo-forge@pangeo-181919.iam.gserviceaccount.com:roles/storage.objectAdmin gs://pangeo-forge-scratch ```

Per-flow things

  1. Register the flow

It's unclear how best to do this. For now, calling it in python code works

>>> pipeline = TerraclimatePipeline(cache_location, target_location, variables, years)
>>> pipeline.flow.register(project="pangeo-forge")

I think that ideally this is done in the GitHub Action. The github worker just needs a token to log into our prefect cloud account.

  1. Execute the flow

This probably just happens on pushes to master since this is the computationally expensive part?

$ prefect run flow --name=terraclimate --project=pangeo-forge

This is also done in the GitHub Action worker.

At this point, a bunch of stuff happens on the k8s cluster

  1. The agent sees the flow run, starts a prefect-job
  2. The prefect job does something, starts a prefect-dask-job
  3. This runs the actual tasks

Random thoughts

  1. The Flow object needs to have the storage and environment (/executor) baked into it when it's registered. How best to do that? Ideally recipe authors don't need to worry about it.
  2. On environments, do we have a standard pangeo-forge environment? Recipes can specify extra packages to install on top.
  3. We have to specify worker_pod.yaml and job.yaml just to set serviceAccount and serviceAccountName. Would be nice to avoid that (open issue at prefect)
jhamman commented 4 years ago

@TomAugspurger - amazing! How should we go about pulling in your proof of concept here? As it stands, I think this repo is a good candidate location for a place to test out a full CI workflow. I can spend some cycles on the GitHub actions side tomorrow and Friday if you want to push your modifications (storage/environment/register) to master here.

TomAugspurger commented 4 years ago

I think hooking up to the CI is the next step.

  1. Get a login token from cloud.prefect.io (you should have an invite, or log in through the bot's GitHub)
  2. Add that to this repository's / org's secrets
  3. Add actions to log in, register, and run the flow.

After that it'd be good to think through how we structure things like the environment and storage.

jhamman commented 4 years ago

@TomAugspurger - I think I'm just missing the storage and environment setups you had in your flow. Once I have those, I think we'll be all set.

TomAugspurger commented 4 years ago

Whoops, https://github.com/TomAugspurger/terraclimate-feedstock/commit/78b127eee021404dc0831b679710a17c3379bf99 has all that (and probably some other stuff).

TomAugspurger commented 4 years ago

I'm registering the flow with python recipe/pipeline.py, but that would probably belong in an external file. You just need to somehow get the Flow object at the top-level of the module I think.

jhamman commented 4 years ago

🎉 We're up and rolling on GitHub actions: https://github.com/pangeo-forge/terraclimate-feedstock/runs/1067909406?check_suite_focus=true

However, the job failed on the prefect side:

Task Run State Message:
Unexpected error: OSError('Forbidden: https://www.googleapis.com/storage/v1/b/pangeo-scratch/o/terraclimate-cache%2F8052597365796761289\nPrimary: /namespaces/pangeo-181919.svc.id.goog with additional claims does not have storage.objects.get access to the Google Cloud Storage object.')

Everything is in master here now. A few notes about what I did:

@TomAugspurger - I know you were working on the service account permissions yesterday. Any hints on what to do next?

TomAugspurger commented 4 years ago

Cool... I think I was surprised about not having to add any gcs-specific permissions to the pangeo-forge Google Service Account. Can you try poking around there?

Though that doesn't explain why it worked yesterday. Maybe it didn't actually work?

TomAugspurger commented 4 years ago

For reference, pangeo has these

Screen Shot 2020-09-03 at 11 59 44 AM

pangeo-forge doesn't. Maybe add those roles?

jhamman commented 4 years ago

@TomAugspurger - I'm at a loss I think. I now have things set up as: image

Yet, I'm still getting permission errors. How do we debug this?

TomAugspurger commented 4 years ago

Looks like fs = gcsfs.GCSFileSystem(token="cloud") was needed.

On Thu, Sep 3, 2020 at 1:51 PM Joe Hamman notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger - I'm at a loss I think. I now have things set up as: [image: image] https://user-images.githubusercontent.com/2443309/92155097-af20bb00-eddb-11ea-9566-82b3ed56b8b0.png

Yet, I'm still getting permission errors. How do we debug this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-forge/terraclimate-feedstock/issues/3#issuecomment-686693003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIVWHNFNKBJPXNV5EVTSD7QTHANCNFSM4QSH72QQ .

jhamman commented 4 years ago

Hmmm, how do we specify this with fsspec?

TomAugspurger commented 4 years ago

I'm not sure, but possibly setting GCSFS_DEFAULT_PROJECT in https://github.com/dask/gcsfs/blob/e142fca992556479930083363e87b8f9509f6175/gcsfs/core.py#L80? I'm really not sure how this works on the hubs. I would think it's the same.

That's assuming we aren't able to use storage_options.

jhamman commented 4 years ago

After a few small fixes to the pipeline, we're there!

image

rabernat commented 4 years ago

Amazing!

The pangeo-scratch bucket is globally r/w for all gcs Pangeo users. We should set up a new bucket and link it to the service account for pangeo-forge.

jhamman commented 4 years ago

@TomAugspurger - is the prefect agent still alive? I tried testing the full dataset pipeline (all years, all vars) and the github action is getting a timeout.

jhamman commented 4 years ago

The pangeo-scratch bucket is globally r/w for all gcs Pangeo users. We should set up a new bucket and link it to the service account for pangeo-forge.

I've added a pangeo-forge-scratch bucket to GCS and given the pangeo-forge service account read/write access.

TomAugspurger commented 4 years ago

It is alive.

These are the logs from the kubernetes pod started by prefect.

/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py:277: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. 
  "The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 295, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/bin/dask-worker", line 11, in <module>
    sys.exit(go())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 446, in go
    main()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 432, in main
    loop.run_sync(run)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 532, in run_sync
    return future_cell[0].result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 426, in run
    await asyncio.gather(*nannies)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 301, in _
    type(self).__name__, timeout
concurrent.futures._base.TimeoutError: Nanny failed to start in 60 seconds

I don't see those logs anywhere in the prefect UI. Perhaps because this is from before the job really gets started since the scheduler or workers failed to start.

TomAugspurger commented 4 years ago

I'll try to figure this out before the call today.

TomAugspurger commented 4 years ago

Fixed the permissions with gsutil iam ch serviceAccount:pangeo-forge@pangeo-181919.iam.gserviceaccount.com:roles/storage.objectAdmin gs://pangeo-forge-scratch

TomAugspurger commented 3 years ago

FYI, I modified the GCP settings yesterday. There was a "Bucket retention policy" set on pangeo-forge-scratch which prevented objects from being deleted. We instead want a lifecycle policy on the scratch bucket that cleans things up.

Presumably, we'll also want a non-scratch bucket.

$ gsutil lifecycle set lifecycle.json gs://pangeo-forge-scratch
Setting lifecycle configuration on gs://pangeo-forge-scratch/...

$ cat lifecycle.json
{
  "lifecycle": {
    "rule": [
      {
        "action": {
          "type": "Delete"
        },
        "condition": {
          "age": 7,
          "isLive": true
        }
      }
    ]
  }
}