rapidsai / build-planning

Tracking for RAPIDS-wide build tasks
https://github.com/rapidsai
0 stars 3 forks source link

Consider consolidating conda solves in CI tests #22

Open charlesbluca opened 6 months ago

charlesbluca commented 6 months ago

Currently (using cuML as an example here), the conda test environment initialization for most CI jobs looks something like creating the test environment:

rapids-dependency-file-generator \
  --output conda \
  --file_key test_python \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml

rapids-mamba-retry env create --force -f env.yaml -n test

And then downloading and installing build artifacts from previous jobs on top of this environment:

CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)

...

rapids-mamba-retry install \
  --channel "${CPP_CHANNEL}" \
  --channel "${PYTHON_CHANNEL}" \
  libcuml cuml

In addition to forcing us to eat the cost of a second conda environment solve, in many cases this can cause some pretty drastic changes to the environment which can be blocking - for example, consider this cuML run which fails because conda is unable to solve a downgrade from Arrow 15.0.0 (build 5) to 14.0.1.

Our current workaround for this is to manually add pinnings to the testing dependencies initially solved such that the artifact installation can be solved, but this can introduce a lot of burden in needing to:

Would it be possible to consolidate some (or all) of these conda environment solves by instead:

  1. downloading the conda artifacts before creating the environment
  2. updating the dependencies.yaml (or, if this isn't possible, the generated conda environment file) to include the desired packages, making sure to explicitly specify source channel to ensure we're picking up the build artifacts
  3. creating the environment with this patched file

In my mind, the main blocker I could see to this working would be if rapids-download-conda-from-s3 requires some conda packages contained in the testing environment to work.

EDIT: Updating to capture state of other projects:

msarahan commented 6 months ago

I don't fully understand how much you need to be rebuilding, but if the answer is "not much" then maybe you could use https://conda.github.io/conda-pack/ or consider building a docker image instead of re-creating the environment each time?

bdice commented 6 months ago

I am proposing a flag --prepend-channels for rapids-dependency-file-generator that will enable a one-step conda environment creation. This would let us use local channels with CI artifacts and would eliminate the second conda solve used to install CI artifacts. https://github.com/rapidsai/dependency-file-generator/pull/67

bdice commented 6 months ago

consider building a docker image instead of re-creating the environment each time?

This is a possible solution, but we'd need to maintain a separate Docker image for every repository and every test matrix configuration (e.g. a container containing the cuDF Python test environment on ARM using Python 3.9 and CUDA 12.2). I don't think we have the infrastructure to manage that while also ensuring that the conda environments are regularly updated (every night?) for all the possible containers so that we are aware of upstream changes that affect us. The workflows we have today would require an explosively large number of large containers (many of the containers would need to contain an entire CTK installed by conda).

This would also still require us to have a conda environment in that container that is compatible with the CI artifacts that we are installing. This doesn't avoid the "can't downgrade Arrow 15" problem because we must install the CI artifacts (conda packages built from the current PR) into the container.

raydouglass commented 6 months ago

The docker repo suffers from a similar problem and it was (partly) mitigated by dynamically editing the environment file: https://github.com/rapidsai/docker/blob/branch-24.04/context/notebooks.sh#L31, so just throwing that out there.

vyasr commented 6 months ago

Once https://github.com/rapidsai/cuml/pull/5781 is merged, do we plan to roll out the same changes to the rest of RAPIDS? IIUC that's what it would take to close this issue.

jakirkham commented 6 months ago

Seems reasonable

We may want ensure that we are always getting cached CI packages too, which would close out issue: https://github.com/rapidsai/build-planning/issues/14

bdice commented 6 months ago

@vyasr Yes, my plan is to convert all of RAPIDS to this single-solve approach.

@jakirkham The approach I'm taking in https://github.com/rapidsai/cuml/pull/5781 will pin the major.minor versions of the packages as shown below. That isn't a full solution for #14 because nightlies of the same major.minor could still be selected, but it should help constrain the solver and reduce the likelihood of problems. If you have further suggestions please weigh in on #14 or on that cuML PR (if the code suggestions are small) and I'll adopt your proposals in a follow-up change.

  test_libcuml:
    common:
      - output_types: conda
        packages:
          - libcuml==24.4.*
          - libcuml-tests==24.4.*
  test_cuml:
    common:
      - output_types: conda
        packages:
          - libcuml==24.4.*
          - cuml==24.4.*
jakirkham commented 6 months ago

Ok it sounds like that is mostly an issue when we are getting close to (or in the midst of) a release. Is that right?

Agree that could be an issue, but that is a much smaller issue than picking up the prior release (especially well after it went out)

bdice commented 6 months ago

Yeah, I think the problematic case would be something like changing the dependencies of a package in a way that the conda packages from CI artifacts are not compatible with the rest of the testing environment, and the solver chooses recent nightlies instead of the PR artifacts. @charlesbluca proposed we could pin the exact channels like ${CPP_CHANNEL}::libcuml but we'd have to find a way to insert that channel information into the package's data in dependencies.yaml since the channel (a local filepath) isn't known ahead of time.

vyasr commented 6 months ago

An additional minor benefit is that this will speed up CI by reducing the number of solves.

bdice commented 6 months ago

An additional minor benefit is that this will speed up CI by reducing the number of solves.

This might not have a large impact on CI runtime. See https://github.com/rapidsai/cudf/pull/15201#issuecomment-1973562102. tl;dr, test startup time is dominated by network activity (downloading multiple GB of packages from RAPIDS and CTK dependencies) and the potential conda solve time savings are comparatively small.

msarahan commented 3 months ago

Taking this on with high priority because cuspatial CI is blocked, and this might help there.

bdice commented 3 months ago

@msarahan Is there a plan to continue on this work? We have implemented it for cuml and cuspatial but nothing else. It would be good to have alignment across RAPIDS to prevent future issues.

msarahan commented 2 months ago

Added task list to parent issue. We'll work on this as time allows.