Closed Sara4994 closed 7 months ago
@Sara4994: This pull request references RHOAIENG-805 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.
@Sara4994 In the PR description here (and maybe even in the commit message), I'd suggest linking to both jira issues. Additionally, in the "Description" section, you could put information on what the issues were and how this change fixes them. In the "How has this been tested" section, you could put details on how you were reproducing the issues (with the implication that those steps no longer cause the issues).
@Sara4994: This pull request references RHOAIENG-805 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.
It would be good to have 2 commits, each one having the corresponding JIRA as the commit message.
+1
These changes need to be done for the aiedge-e2e pipeline also, since that's the current main one now.
@grdryn I think that it shouldn't be needed for the aiedge-e2e as that one uses volumeClaimTemplate, so each pipelineRun will have its own persistent volume, so there won't be any collision.
My understanding is that that's just in the example PipelineRun
, rather than an aspect of the Pipeline
itself. Users will create their own PipelineRun files (or trigger Pipelines from the UI), where they may specify either PersistentVolumeClaim
or VolumeClaimTemplate
to satisfy the workspace requirements for their case.
An advantage of VolumeClaimTemplate
is that it avoids any state collisions like this. An advantage of using an existing PersistentVolumeClaim
is that in a more restricted edge environment, there may only be a set number of pre-provisioned PersistentVolumes, rather than an almost limitless on-demand provisioning that would typically be seen in a cluster hosted in the cloud. So depending on where it's running (among other reasons), a user may choose one or the other.
My understanding is that that's just in the example
PipelineRun
, rather than an aspect of thePipeline
itself. Users will create their own PipelineRun files (or trigger Pipelines from the UI), where they may specify eitherPersistentVolumeClaim
orVolumeClaimTemplate
to satisfy the workspace requirements for their case.An advantage of
VolumeClaimTemplate
is that it avoids any state collisions like this. An advantage of using an existingPersistentVolumeClaim
is that in a more restricted edge environment, there may only be a set number of pre-provisioned PersistentVolumes, rather than an almost limitless on-demand provisioning that would typically be seen in a cluster hosted in the cloud. So depending on where it's running (among other reasons), a user may choose one or the other.
You are right, I originally thought we would for example support just the VolumeClaimTemplate
but the example you said makes sense. Then it is definitely worth to split the directories.
Also, is it enough to split just by the model name? Maybe it is just safe to expect that there won't be 2 pipelines running for the same model concurrently?
@Sara4994 FYI there are 3 commits now, ideally you would have 2, each covering a separate jira.
My understanding is that that's just in the example
PipelineRun
, rather than an aspect of thePipeline
itself. Users will create their own PipelineRun files (or trigger Pipelines from the UI), where they may specify eitherPersistentVolumeClaim
orVolumeClaimTemplate
to satisfy the workspace requirements for their case. An advantage ofVolumeClaimTemplate
is that it avoids any state collisions like this. An advantage of using an existingPersistentVolumeClaim
is that in a more restricted edge environment, there may only be a set number of pre-provisioned PersistentVolumes, rather than an almost limitless on-demand provisioning that would typically be seen in a cluster hosted in the cloud. So depending on where it's running (among other reasons), a user may choose one or the other.You are right, I originally thought we would for example support just the
VolumeClaimTemplate
but the example you said makes sense. Then it is definitely worth to split the directories.Also, is it enough to split just by the model name? Maybe it is just safe to expect that there won't be 2 pipelines running for the same model concurrently?
@Sara4994 FYI there are 3 commits now, ideally you would have 2, each covering a separate jira.
@MarianMacik yep the commits are resolved now... thank you :)
/retest-required
Is it enough to split just by the model name? Maybe it is just safe to expect that there won't be 2 pipelines running for the same model concurrently?
That's a good question, and one that we discussed a little while pairing on this change, and I think it's a good conversation to continue here.
In my opinion, there are a couple of variables to consider:
buildah-cache
workspace requirement is fulfilled (e.g. PersistentVolumeClaim vs VolumeClaimTemplate)As we know, the issue will only occur when the same PV is reused for two PipelineRuns at the same time, and your question highlights that this fix will only solve for the case where two runs are in flight for different model names, and if there are two runs in flight for the same model, then the problem could still occur. So this kind of just reduces the possibility, but it could still occur, right?
Some possible solutions, and their limitations:
We could generate some unique part for the directory, rather than using the model name, but I'd expect that that would lead to a different problem, where the volume is filled up over time, since there would be directories for old runs kept indefinitely (at least with VolumeClaimTemplate, old PVCs would get deleted when the corresponding old PipelineRun is deleted).
We could document that we only support VolumeClaimTemplate to satisfy the workspace, but that might not be ideal if the build is happening in an edge location that doesn't have a storage provisioner that can create new PVs on demand (maybe there are N PVs pre-provisioned).
We could try to implement some sort of concurrency management, where we block a run from continuing if there's already one from the same pipeline in flight (I don't believe Tekton has this ability right now?). I'm not sure what the best approach to that would be, or whether it should be in Tekton itself or outside. Maybe there's some existing pattern we could use for this?
We could document the limitation as part of this PR, that only one run for a particular model should be run at a time, if using a shared PV?
WDYT @MarianMacik? Also @LaVLaS you might have better thoughts on this :)
There are alot of good conversations about making the model directory unique to support multiple PipelineRuns in the same PVC. Given that we can't enforce concurrency beyond the k8s scheduler managing multiple Pods requesting access to the same PVC and we can't prevent a user from running parallel pipelines on the same model, we could settle a "globally unique [tekton] identifier" like the PipelineRun uid (context.pipelineRun.name
) OR PipelineRun name ( context.pipelineRun.uid
) as the directory name which could be supplied to all tasks as a parameter.
Would this simplify the initial problem to
mkdir -p model_dir/<model_name>/$(context.pipelineRun.name)
And all affected standalone tasks could accept this value as a "Context Dir" parameter
We had a group discussion about this today, and our conclusion for now is that we'll continue with the current approach in this pull request (model-dir_<model name>
) for now (we may do something better in future), but document some things around this:
@grdryn sounds reasonable. Another option would be to incorporate the uid/name of the pipeline and document that cleaning is up to a user if a reusable PersistentVolumeClaim is used. The idea is that if the management of the PVCs is "manual" by the user, so should be the cleanup, at least for now.
But I am fine with the current approach as well.
@MarianMacik Yes, this is along the lines of what @LaVLaS was originally suggesting. We could do that, but in that case we should provide instructions on cleanup in our docs, and the user has to do something outside the standard flow. IMHO, the easiest thing for us to do right now is to make it clear that our recommendation is VolumeClaimTemplate if you want concurrency, otherwise just make sure you don't run two of the same model concurrently.
@MarianMacik Yes, this is along the lines of what @LaVLaS was originally suggesting. We could do that, but in that case we should provide instructions on cleanup in our docs, and the user has to do something outside the standard flow. IMHO, the easiest thing for us to do right now is to make it clear that our recommendation is VolumeClaimTemplate if you want concurrency, otherwise just make sure you don't run two of the same model concurrently.
@grdryn included this instruction as part of readme
/retest-required
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: grdryn, Sara4994
The full list of commands accepted by this bot can be found here.
The pull request process is described here
…nerfile-exists task
Description
This PR fixes the failure of Pipelines happens when two both(Tensorflow and Bike rentals) run in parallel. [JIRA] Also, it fixes the task 'check-model-and-containerfile-exists' to check the correct modelRelativePath and to fail if incorrect. [JIRA]
How Has This Been Tested?
Mount all the required resources along with both the pipelineRuns.
To Reproduce the issue:
Paste the commands all in one and then check the PipelineRuns in console.
Alternatively, do
oc create -f tekton/build-container-image-pipeline/aws-env-real.yaml oc apply -k tekton/build-container-image-pipeline/ oc create -f tekton/build-container-image-pipeline/build-container-image-pipelinerun-tensorflow-housing.yaml watch the first TaskRun (kserve-download-model) turn green in the console, then paste
oc create -f tekton/build-container-image-pipeline/build-container-image-pipelinerun-bike-rentals.yaml and observe both the PipelineRuns in the console.
Merge criteria: