nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Onboard vLLM upstream CI project to MOC OpenShift Clusters #779

Open dystewart opened 1 month ago

dystewart commented 1 month ago

Motivation

The objective of the project is to install and test the upstream CI pipeline that Anyscale has already deployed to Google (using free credits) and other places in the MOC. We would like to be a continuing provider of the vLLM code builds from Berkeley as a first step to collaborating with them.

Completion Criteria

The definition of done is that this CI build kicks off automatically in production once a night after the project has been integrated and tested, and any errors or issues are reported both to an local engineer (MOC or Red Hat) and to the regular CI pipeline owners, using their existing methods, or revisions if needed to comply with MOC production rules. The project should run in production, but we can start building in a test cluster if necessary.

Description

Heidi has created this project in ColdFront and added Taj as manager. we didn't request any resources yet, because we should be able to accommodate this project in one of the existing test clusters (nerc-ocp-test or rhoai-test) before moving into nerc-ocp-prod.

(Note that Red Hat already uses vLLM in RHOAI we think, but that would be using an older release. This project is for building whatever is the newest release nightly.) Here's the project repo: https://github.com/vllm-project/vllm

They said that they can do this with any GPUs, so let's try the A100s first, since they seem to be relatively unused in production NERC. We can try v100s etc. later if we succeed with the A100s first. Please note the resource usage during builds once you are running in production, so that we will be able to estimate the ongoing usage charges for the project.

The RHOAI team is paying for resource usage on the MOC for this project (Jen is setting this up, but we can start with collecting the project usage with Heidi as the PI.).

dystewart commented 1 month ago

TO DO: Make the bullet points in the list above into dedicated smaller issues

dystewart commented 1 month ago

cc: @hpdempsey

tssala23 commented 1 month ago

Buildkite operator is working on the test cluster in the buildkite namespace. As none of the resources are cluster scoped I have not made a PR to add the manifests to the config repo, but instead just applied them to that namespace on the test cluster. Here is a link to a repo containing the deployed manifests. Here is a link to the repo https://github.com/tssala23/buildkite Helm template along with the this repo https://github.com/dtrifiro/buildkite-on-openshift were used to obtain the manifests.

A few quirks while deploying:

joachimweyl commented 1 month ago

@tssala23 or @dystewart please provide an estimate for this issue.

schwesig commented 4 weeks ago

/CC @schwesig now on test then on prod buildkite namespace

maxamillion commented 2 weeks ago

@schwesig @dystewart there hasn't been an update in a couple weeks. What's the current status? Any blockers? Thanks!

dystewart commented 1 week ago

Hey all, sorry for the delay in updates here is our progress so far:

  1. We have installed a buildkite agent
  2. Had some discussion of how and where we would build and store container images (we went with buildConfigs actually build the image and output to imageStreams)
  3. We built the Dockerfile.cpu image successfully, though I did have to change the base image reference from Docker.io to quay.io to sidestep pull rate errors
  4. I created a fork of vllm to make sure our buildConfigs were triggered automatically on push events which worked
  5. We attempted to build also Dockerfile but ran into some gcc error (Digging into this in the morning)

Next steps:

  1. Figure out what is causing the main Dockerfile build to fail
  2. create the test-pipeline.yaml as a sanity check on our config so far
tssala23 commented 6 days ago

Update: Main pipeline is failing due to a check this check. It also fails here as buildconfig are not able to interpret EOF. Removing the EOF and replacing with just the command, and skipping the check, we are able to get the main image built.

We have been able to call the buildconfig from the buildkite pipeline and wait for the build to complete before moving on. This involved get the oc command into the containers created by buildkite and adding some extra permissions.

Next steps:

  1. Figure out handling concurrent builds
  2. Replicate the whole pipeline to perform tests on the created image (likely to be broken down into smaller tasks)