Integration testing requirements

rabernat commented 3 years ago

On today's Pangeo ML WG call, @nbren12 re-raised the perennial issue of the need for integration testing of the many layers of our stack, particularly when it comes to cloud i/o.

This has been discussed in many different related issues:

The feeling from heavy users is that there are still lingering intermittent bugs related to cloud i/o that render workflows unstable. One way to the bottom of this is to get more rigorous about system-wide integration testing.

Some questions we need to resolve to move forward with this idea, and my initial responses, are:

What are the workflows we want to test?
- Write random data to cloud
- Read back data
- Copy data
- Rechunk data
How big does the test need to be in order to be realistic?
- My sense: > 100 GB
Which combinations of libraries do we want to include?
- dask (w/o xarray)
- xarray (via dask)
- gcsfs
- s3fs
- adlfs
- rechunker
How do we orchestrate an integration test like this? Here is one idea
- Build dedicated docker containers with the desired environments using GitHub workflows
- Connect to dask clusters from GitHub workflows (maybe could use Coiled to make this easy; otherwise will need dedicated dask_kubernetes or dask_gateway set up for the purpose)
- Launch integration tests on remote dask clusters from github workflows

If we can agree on a clearly scoped plan, I think we can support the cloud computing costs. We can also considering asking Anaconda (@martindurant) to spend some time on this via our Pangeo Forge contract with them.

cc @jhamman @cisaacstern @tomnicholas @dcherian

rabernat commented 3 years ago

Noah, it would be great to get your thoughts on what sort of testing would help surface the issues you've been having.

nbren12 commented 3 years ago

Thanks so much for setting up this repo! I think it will be a useful collaboration point. ccing @spencerkclark @oliverwm1 @frodre for attention. We are particularly attuned to the I/O related issues.

What are the workflows we want to test?

Your suggestion are good. Also, my experience is that coarse-graining operations tend to cause many problems. Unlike a time-mean the output will be too large for memory, but the data reduction is enough that the input and output chunk-sizes should differ.

Generally, I think "writes" are less robust than "reads", but the latter is more frequently used by this community.

How big does the test need to be in order to be realistic?

I think so, but I need a better sense in the error rate per GB/HTTP request.

Which combinations of libraries do we want to include?

This often shows up in apache beam jobs for us, since we don't use dask clusters. We still use dask + xarray as a "lazy" array format, just not for parallelism much.

How do we orchestrate an integration test like this?

Agreed that we can build/push docker images from github CI.

For actually running the jobs, a cronjob in an existing k8s cluster might simplify deployment since there will be no need to authenticate access via dask gateway, firewall punching, kubernetes keys, etc. Also, E2E testing can take a long time, so CI is more costly especially if the compute is not actually happening on the CI server. This is the philosophy of CD tools like https://argoproj.github.io/argo-cd/.

If we can agree on a clearly scoped plan, I think we can support the cloud computing costs.

This would be great! I think I should be to justify investing our time on this effort to my bosses, especially since we rely on this stack so much and already contribute to it.

dcherian commented 3 years ago

@scottyhq will be interested. If our NASA proposal gets funded, we were planning to put some time towards something like this.

nbren12 commented 3 years ago

Common problems arise from using too small chunks. It is easy to exceed google's limit of 1000 API requests/s (for reads), 100 API requests/s (writes), and 1 request/s (write to same object).

rabernat commented 3 years ago

It is easy to exceed google's limit of 1000 API requests/s (for reads), 100 API requests/s (writes), and 1 request/s (write to same object).

What would you expect the client libraries to do to handle these limits, particularly in a distributed context? Retries? Exponential backoff?

nbren12 commented 3 years ago

Exponential backoff is Google's recommend solution. Gcsfs has this, but not for all operations (see https://github.com/dask/gcsfs/pull/387) and the test suite doesn't have great coverage for it (e.g. here's a bug fix I recently contributed https://github.com/dask/gcsfs/pull/380).

rabernat commented 3 years ago

So the idea here is that the sort of integration testing we are proposing would help surface specific issues (like the ones you already raised), which would help guide upstream development?

nbren12 commented 3 years ago

Exactly. Currently I think gcsfs relies on user provided bug reports to find these problems, but usually these problems only occur at scale, and it is hard for users to construct minimal examples that are divorced from their idiosyncratic infrastructure.

nbren12 commented 3 years ago

The usual workflow would be

bug shows up here
open issue on gcsfs/...
reproduce the bug in a local unit test of the buggy library (refactoring the code to be testable if required)
fix the bug.

oliverwm1 commented 3 years ago

Thanks for getting the ball rolling on this @rabernat! I think the tests you propose would reveal the main issues that have challenged us and make sure these issues don't creep back into the codebase once they are fixed. At least on the gcsfs side, I get the impression that a lot of people have been challenged by intermittent I/O issues: dask/gcsfs#327, dask/gcsfs#323, dask/gcsfs#315, dask/gcsfs#316, dask/gcsfs#290

Beyond just reliability and catching bugs, would be cool if these tests were also leveraged to monitor performance.

rabernat commented 3 years ago

If our NASA proposal gets funded, we were planning to put some time towards something like this.

@scottyhq and @dcherian - when do you expect to know about the NASA proposal? My understanding is that the scope is a lot broader than just cloud i/o. Would it be a problem if we bootstrapped something on a shorter timescale and then handed it off if / when your proposal is funded?

nbren12 commented 3 years ago

Another simple way to get some coverage is to install a fixed set of libraries and run their test suites. This will often find errors related to API changes. This is the usual process of preparing a "distribution" like nix or Debian. FYI, I have been working on packaging some of the pangeo stack with the nix package manager. Here's a package I added for gcsfs package. Therefore, there is some level of integration testing being done by their CI servers against specific versions of dask/xarray etc: https://hydra.nixos.org/build/142494940.

dcherian commented 3 years ago

when do you expect to know about the NASA proposal?

:man_shrugging:

My understanding is that the scope is a lot broader than just cloud i/o.

Yeah I think so.

Would it be a problem if we bootstrapped something on a shorter timescale and then handed it off if / when your proposal is funded?

Of course! I don't see an issue at all.

scottyhq commented 3 years ago

when do you expect to know about the NASA proposal?

Our proposal has a start date of 6/30/2021 to give you a ballpark timeline. If funded I'd definitely be interested in collaborating on this. I don't know how many issues are specific to gcsfs, but NASA is still pretty focused on AWS, so we'd want to use s3 storage as well.

rabernat commented 3 years ago

So here is one version of the steps that need to happen to get this working.

On each cloud provider we wish to test

[ ] Someone sets up a dask_gateway / Coiled / whatever dask cluster we can connect to using credentials stored as secrets in this repo
[ ] Someone sets up a bucket with r/w access via crendentials stored as secrets in this repo

In this repo

[ ] We configure a workflow to build a docker image used by the workers (perhaps could bootstrap or skip this step by using pangeo-docker-images)
[ ] We add python scripts that do the desired computations
[ ] We configure a workflow to run the scripts using a matrix of different clusters / configuration options

nbren12 commented 3 years ago

Someone sets up a dask_gateway / Coiled / whatever dask cluster we can connect to using credentials stored as secrets in this repo

Just wondering if it would be simpler to put a k8s cron job like this in each cluster:

apiVersion: batch/v1
kind: CronJob
metadata:
  generateName: pangeo-integration-test
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: accountWithNeededPrivs
          containers:
          - name: hello
            image: pangeo-integration-test:latest
            imagePullPolicy: IfNotPresent
            # all logic happens here
            # and publishes a integration report to a static webpage
            command: ["/bin/bash", "run_tests.sh"]

          restartPolicy: OnFailure

Then, this repo wouldn't need to authenticate against each cluster (or even know they exist) beyond providing a link to the integration test reports. Also, the CI wouldn't need to wait around while the jobs finish. Basically, this decouples the integration tests from the update cycle of this repo (i.e. continuous delivery). We could run some basic checks like read/write from a bucket, run test suite, etc from this repo's CI before publishing the docker image pangeo-integration-test:latest.

yuvipanda commented 3 years ago

How about we run a github self-hosted runner on a Kubernetes cluster that also has dask gateway configured? That way, we get all the goodness of GitHub actions without having to deal with nasty auth stuff.

martindurant commented 3 years ago

A couple of notes:

for gcsfs, recent improvements should have helped things in the version currently waiting on conda-forge. Of course, full-scale testing would still be useful. This would require a real bucket on GCS and payment
for s3fs, there is already a proposal to add minio as an additional test runner to moto, which would certainly find some edge cases, particularly for non-AWS S3s. Testing against real S3 would, again, require money.
to my mind, designing xarray+dask+storage tests amounts to running some specialised recipes that fill much of the parameter space in (number of chunks, size of chunks, number of files, number of threads/processes

rabernat commented 3 years ago

Thanks for the input Martin. The plan is to indeed spend real money to test the whole stack at scale on real cloud providers.

pangeo-data / pangeo-integration-tests

Integration testing requirements #1

On each cloud provider we wish to test

In this repo