Open dystewart opened 1 month ago
TO DO: Make the bullet points in the list above into dedicated smaller issues
cc: @hpdempsey
Buildkite operator is working on the test cluster in the buildkite
namespace.
As none of the resources are cluster scoped I have not made a PR to add the manifests to the config repo, but instead just applied them to that namespace on the test cluster. Here is a link to a repo containing the deployed manifests.
Here is a link to the repo https://github.com/tssala23/buildkite
Helm template
along with the this repo https://github.com/dtrifiro/buildkite-on-openshift were used to obtain the manifests.
A few quirks while deploying:
--set config.org=
in the helm command.stringData
as values were not getting encoded properly.@tssala23 or @dystewart please provide an estimate for this issue.
/CC @schwesig
now on test
then on prod
buildkite
namespace
@schwesig @dystewart there hasn't been an update in a couple weeks. What's the current status? Any blockers? Thanks!
Hey all, sorry for the delay in updates here is our progress so far:
Next steps:
Update: Main pipeline is failing due to a check this check. It also fails here as buildconfig are not able to interpret EOF. Removing the EOF and replacing with just the command, and skipping the check, we are able to get the main image built.
We have been able to call the buildconfig from the buildkite pipeline and wait for the build to complete before moving on. This involved get the oc command into the containers created by buildkite and adding some extra permissions.
Next steps:
Motivation
The objective of the project is to install and test the upstream CI pipeline that Anyscale has already deployed to Google (using free credits) and other places in the MOC. We would like to be a continuing provider of the vLLM code builds from Berkeley as a first step to collaborating with them.
Completion Criteria
The definition of done is that this CI build kicks off automatically in production once a night after the project has been integrated and tested, and any errors or issues are reported both to an local engineer (MOC or Red Hat) and to the regular CI pipeline owners, using their existing methods, or revisions if needed to comply with MOC production rules. The project should run in production, but we can start building in a test cluster if necessary.
Description
Heidi has created this project in ColdFront and added Taj as manager. we didn't request any resources yet, because we should be able to accommodate this project in one of the existing test clusters (nerc-ocp-test or rhoai-test) before moving into nerc-ocp-prod.
(Note that Red Hat already uses vLLM in RHOAI we think, but that would be using an older release. This project is for building whatever is the newest release nightly.) Here's the project repo: https://github.com/vllm-project/vllm
They said that they can do this with any GPUs, so let's try the A100s first, since they seem to be relatively unused in production NERC. We can try v100s etc. later if we succeed with the A100s first. Please note the resource usage during builds once you are running in production, so that we will be able to estimate the ongoing usage charges for the project.
The RHOAI team is paying for resource usage on the MOC for this project (Jen is setting this up, but we can start with collecting the project usage with Heidi as the PI.).