RHOAIENG-805: Fix for pipelineRuns failure and check-model-and-contai…

Sara4994 commented 7 months ago

…nerfile-exists task

Description

This PR fixes the failure of Pipelines happens when two both(Tensorflow and Bike rentals) run in parallel. [JIRA] Also, it fixes the task 'check-model-and-containerfile-exists' to check the correct modelRelativePath and to fail if incorrect. [JIRA]

How Has This Been Tested?

Mount all the required resources along with both the pipelineRuns.

To Reproduce the issue:

Paste the commands all in one and then check the PipelineRuns in console.

Alternatively, do

oc create -f tekton/build-container-image-pipeline/aws-env-real.yaml oc apply -k tekton/build-container-image-pipeline/ oc create -f tekton/build-container-image-pipeline/build-container-image-pipelinerun-tensorflow-housing.yaml watch the first TaskRun (kserve-download-model) turn green in the console, then paste

oc create -f tekton/build-container-image-pipeline/build-container-image-pipelinerun-bike-rentals.yaml and observe both the PipelineRuns in the console.

Merge criteria:

[ ] The commits are squashed in a cohesive manner and have meaningful messages.
[ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
[ ] The developer has manually tested the changes and verified that the changes work

openshift-ci-robot commented 7 months ago

@Sara4994: This pull request references RHOAIENG-805 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/opendatahub-io/ai-edge/pull/199): >…nerfile-exists task > > > >## Description > > >## How Has This Been Tested? > > > > >## Merge criteria: > > > >- [ ] The commits are squashed in a cohesive manner and have meaningful messages. >- [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). >- [ ] The developer has manually tested the changes and verified that the changes work > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=opendatahub-io%2Fai-edge). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

grdryn commented 7 months ago

@Sara4994 In the PR description here (and maybe even in the commit message), I'd suggest linking to both jira issues. Additionally, in the "Description" section, you could put information on what the issues were and how this change fixes them. In the "How has this been tested" section, you could put details on how you were reproducing the issues (with the implication that those steps no longer cause the issues).

openshift-ci-robot commented 7 months ago

@Sara4994: This pull request references RHOAIENG-805 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/opendatahub-io/ai-edge/pull/199): >…nerfile-exists task > > > >## Description >This PR fixes the failure of Pipelines happens when two both(Tensorflow and Bike rentals) run in parallel. [[JIRA]](https://issues.redhat.com/browse/RHOAIENG-805) >Also, it fixes the task 'check-model-and-containerfile-exists' to check the correct modelRelativePath and to fail if incorrect. [[JIRA]](https://issues.redhat.com/browse/RHOAIENG-862) > >## How Has This Been Tested? >Mount all the required resources along with both the pipelineRuns. > >## To Reproduce the issue: >Paste the commands all in one and then check the PipelineRuns in console. > >Alternatively, do > >oc create -f tekton/build-container-image-pipeline/aws-env-real.yaml >oc apply -k tekton/build-container-image-pipeline/ >oc create -f tekton/build-container-image-pipeline/build-container-image-pipelinerun-tensorflow-housing.yaml >watch the first TaskRun (kserve-download-model) turn green in the console, then paste > >oc create -f tekton/build-container-image-pipeline/build-container-image-pipelinerun-bike-rentals.yaml >and observe both the PipelineRuns in the console. > >## Merge criteria: > > > >- [ ] The commits are squashed in a cohesive manner and have meaningful messages. >- [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). >- [ ] The developer has manually tested the changes and verified that the changes work > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=opendatahub-io%2Fai-edge). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

grdryn commented 7 months ago

It would be good to have 2 commits, each one having the corresponding JIRA as the commit message.

+1

These changes need to be done for the aiedge-e2e pipeline also, since that's the current main one now.

@grdryn I think that it shouldn't be needed for the aiedge-e2e as that one uses volumeClaimTemplate, so each pipelineRun will have its own persistent volume, so there won't be any collision.

My understanding is that that's just in the example PipelineRun, rather than an aspect of the Pipeline itself. Users will create their own PipelineRun files (or trigger Pipelines from the UI), where they may specify either PersistentVolumeClaim or VolumeClaimTemplate to satisfy the workspace requirements for their case.

An advantage of VolumeClaimTemplate is that it avoids any state collisions like this. An advantage of using an existing PersistentVolumeClaim is that in a more restricted edge environment, there may only be a set number of pre-provisioned PersistentVolumes, rather than an almost limitless on-demand provisioning that would typically be seen in a cluster hosted in the cloud. So depending on where it's running (among other reasons), a user may choose one or the other.

MarianMacik commented 7 months ago

My understanding is that that's just in the example PipelineRun, rather than an aspect of the Pipeline itself. Users will create their own PipelineRun files (or trigger Pipelines from the UI), where they may specify either PersistentVolumeClaim or VolumeClaimTemplate to satisfy the workspace requirements for their case.

An advantage of VolumeClaimTemplate is that it avoids any state collisions like this. An advantage of using an existing PersistentVolumeClaim is that in a more restricted edge environment, there may only be a set number of pre-provisioned PersistentVolumes, rather than an almost limitless on-demand provisioning that would typically be seen in a cluster hosted in the cloud. So depending on where it's running (among other reasons), a user may choose one or the other.

You are right, I originally thought we would for example support just the VolumeClaimTemplate but the example you said makes sense. Then it is definitely worth to split the directories.

Also, is it enough to split just by the model name? Maybe it is just safe to expect that there won't be 2 pipelines running for the same model concurrently?

@Sara4994 FYI there are 3 commits now, ideally you would have 2, each covering a separate jira.

Sara4994 commented 7 months ago

My understanding is that that's just in the example PipelineRun, rather than an aspect of the Pipeline itself. Users will create their own PipelineRun files (or trigger Pipelines from the UI), where they may specify either PersistentVolumeClaim or VolumeClaimTemplate to satisfy the workspace requirements for their case. An advantage of VolumeClaimTemplate is that it avoids any state collisions like this. An advantage of using an existing PersistentVolumeClaim is that in a more restricted edge environment, there may only be a set number of pre-provisioned PersistentVolumes, rather than an almost limitless on-demand provisioning that would typically be seen in a cluster hosted in the cloud. So depending on where it's running (among other reasons), a user may choose one or the other.

You are right, I originally thought we would for example support just the VolumeClaimTemplate but the example you said makes sense. Then it is definitely worth to split the directories.

Also, is it enough to split just by the model name? Maybe it is just safe to expect that there won't be 2 pipelines running for the same model concurrently?

@Sara4994 FYI there are 3 commits now, ideally you would have 2, each covering a separate jira.

@MarianMacik yep the commits are resolved now... thank you :)

Sara4994 commented 7 months ago

/retest-required

grdryn commented 7 months ago

Is it enough to split just by the model name? Maybe it is just safe to expect that there won't be 2 pipelines running for the same model concurrently?

That's a good question, and one that we discussed a little while pairing on this change, and I think it's a good conversation to continue here.

In my opinion, there are a couple of variables to consider:

How the buildah-cache workspace requirement is fulfilled (e.g. PersistentVolumeClaim vs VolumeClaimTemplate)
The expectation of the user when reusing an existing shared PersistentVolume

As we know, the issue will only occur when the same PV is reused for two PipelineRuns at the same time, and your question highlights that this fix will only solve for the case where two runs are in flight for different model names, and if there are two runs in flight for the same model, then the problem could still occur. So this kind of just reduces the possibility, but it could still occur, right?

Some possible solutions, and their limitations:

We could generate some unique part for the directory, rather than using the model name, but I'd expect that that would lead to a different problem, where the volume is filled up over time, since there would be directories for old runs kept indefinitely (at least with VolumeClaimTemplate, old PVCs would get deleted when the corresponding old PipelineRun is deleted).
We could document that we only support VolumeClaimTemplate to satisfy the workspace, but that might not be ideal if the build is happening in an edge location that doesn't have a storage provisioner that can create new PVs on demand (maybe there are N PVs pre-provisioned).
We could try to implement some sort of concurrency management, where we block a run from continuing if there's already one from the same pipeline in flight (I don't believe Tekton has this ability right now?). I'm not sure what the best approach to that would be, or whether it should be in Tekton itself or outside. Maybe there's some existing pattern we could use for this?
We could document the limitation as part of this PR, that only one run for a particular model should be run at a time, if using a shared PV?

WDYT @MarianMacik? Also @LaVLaS you might have better thoughts on this :)

LaVLaS commented 7 months ago

There are alot of good conversations about making the model directory unique to support multiple PipelineRuns in the same PVC. Given that we can't enforce concurrency beyond the k8s scheduler managing multiple Pods requesting access to the same PVC and we can't prevent a user from running parallel pipelines on the same model, we could settle a "globally unique [tekton] identifier" like the PipelineRun uid (context.pipelineRun.name) OR PipelineRun name ( context.pipelineRun.uid) as the directory name which could be supplied to all tasks as a parameter.

Would this simplify the initial problem to

mkdir -p model_dir/<model_name>/$(context.pipelineRun.name)

And all affected standalone tasks could accept this value as a "Context Dir" parameter

grdryn commented 7 months ago

We had a group discussion about this today, and our conclusion for now is that we'll continue with the current approach in this pull request (model-dir_<model name>) for now (we may do something better in future), but document some things around this:

We strongly recommend using VolumeClaimTemplate rather than re-using a PersistentVolumeClaim
- If users have a limited number of PVs, they should set the retention of PipelineRuns low enough that they won't run out
If you decide to reuse a PersistentVolumeClaim for any of the workspaces, then you shouldn't reuse it for any runs with the same model concurrently, as the runs can interfere with each others' state.

MarianMacik commented 7 months ago

@grdryn sounds reasonable. Another option would be to incorporate the uid/name of the pipeline and document that cleaning is up to a user if a reusable PersistentVolumeClaim is used. The idea is that if the management of the PVCs is "manual" by the user, so should be the cleanup, at least for now.

But I am fine with the current approach as well.

grdryn commented 7 months ago

@MarianMacik Yes, this is along the lines of what @LaVLaS was originally suggesting. We could do that, but in that case we should provide instructions on cleanup in our docs, and the user has to do something outside the standard flow. IMHO, the easiest thing for us to do right now is to make it clear that our recommendation is VolumeClaimTemplate if you want concurrency, otherwise just make sure you don't run two of the same model concurrently.

Sara4994 commented 7 months ago

@MarianMacik Yes, this is along the lines of what @LaVLaS was originally suggesting. We could do that, but in that case we should provide instructions on cleanup in our docs, and the user has to do something outside the standard flow. IMHO, the easiest thing for us to do right now is to make it clear that our recommendation is VolumeClaimTemplate if you want concurrency, otherwise just make sure you don't run two of the same model concurrently.

@grdryn included this instruction as part of readme

Sara4994 commented 7 months ago

/retest-required

openshift-ci[bot] commented 7 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: grdryn, Sara4994

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/opendatahub-io/ai-edge/blob/main/OWNERS)~~ [grdryn] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

opendatahub-io / ai-edge