open-services-group / byon

Bring Your Own Notebook (BYON) project repository.
GNU General Public License v3.0
4 stars 8 forks source link

Periodic polling of container images #17

Closed tumido closed 1 year ago

tumido commented 2 years ago

Investigate how to manage imagestream updates. This issue is looking for answers/analysis on:

Based on my initial screening it seems there's no way to gate when image stream tag import if set to scheduled. Also it's unsure when the cache is cleaned or accessed (is this per node?) and how an image is cached. Investigate what happens when pods are started: if the image is pulled from a remote repository or mirrored to internal container registry on the first start and pulled from there. What happens when pod is scheduled to a different node (again pulling from internal/external registry?) Experiments needed. It effects when we get the updated image and if we know about it or user is served a newer version withouth admin/validation pipeline noticing first.

Besides scheduled import described above, there's no common reactive facility to initiate/trigger notification from remote registries - webhooks etc, this is remote repository specific and must be set on the remote repository side.

tumido commented 2 years ago

/assign @tumido

tumido commented 2 years ago

/label wg/byon-build-pipelines

sesheta commented 2 years ago

@tumido: The label(s) /label wg/byon-build-pipelines cannot be applied. These labels are supported: community/discussion, community/group-programming, community/maintenance, community/question, deployment_name/ocp4-stage, deployment_name/ocp4-test, deployment_name/moc-prod, hacktoberfest, hacktoberfest-accepted, kind/cleanup, kind/demo, kind/deprecation, kind/documentation, kind/question, sig/advisor, sig/build, sig/cyborgs, sig/devops, sig/documentation, sig/indicators, sig/investigator, sig/knowledge-graph, sig/slo, sig/solvers, thoth/group-programming, thoth/human-intervention-required, thoth/potential-observation, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, triage/accepted, triage/duplicate, triage/needs-information, triage/not-reproducible, triage/unresolved, lifecycle/submission-accepted, lifecycle/submission-rejected

In response to [this](https://github.com/open-services-group/byon/issues/17#issuecomment-1026073684): >/label wg/byon-build-pipelines Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tumido commented 2 years ago

/assign @mimotej

mimotej commented 2 years ago

So far I have conducted some experiments and I have the following information:

  1. importPolicy.scheduled on Imagestream causes that OCP periodically (every 15 minutes by default) checks for changes on the image (meaning it checks if digest of an image has changed). This can be useful for automatic updates of images, but it does not generate any event, so it might be harder to detect.
  2. referencePolicy works in the same way in OCP 4.x as in OCP 3.x - not sure why it is not mentioned in the docs, but there doesn't seem to be any change. One bit that is also mentioned in the docs is that when using referencePolicy.type Local we have to set importPolicy.insecure otherwise the image will not be pulled. When using referencePolicy.type Source this is not the case, and you don't have to set importPolicy.insecure. One interesting bit, it seems that even in case of referencePolicy.type Source ImageStream stores image to internal registry, but only instructs client to pull from external when possible (it also points to manifest SHA address not tag). In my test I have created ImageStream with referencePolicy.type Source and even though I deleted image from external image registry it was still able to pull image which didn't exist. (note *)
  3. Another important part is pull policy, which is defined in pod creation. In this case, imagePullPolicy can have three values IfNotPresent, Always, Never, where Never isn't useful for us, because it only starts container when the image is pre-cached on the node. ifNotPresent checks if an image exists with same name and tag (it doesn't consider differences in digests) and if it is available it creates a container, otherwise it pulls an image. Always also checks if image is pre-cached, but it also checks digests and if they are not the same (for example tag didn't change, but there were some security patches), it pulls new image.
  4. Triggering pipelines is another problem we need to resolve first I have found examples from Andrew Block how to trigger pipelines when image changes, but as I mentioned earlier ImageStream does not generate event upon image change, so this option is not usable in our case.  Combing policies 1-3, seems to result in predictable behavior which seems to follow what is written in docs or what I have written above. There are still some scenarios which need further testing, once I try other possible scenarios I will extend this comment on these notes as well.

*EDIT: Seems that OCP has a bug, which causes ImageStream to not be resolved properly more about here: https://bugzilla.redhat.com/show_bug.cgi?id=2000216 . In my testing I have been using "workaround", where I pointed image to internal registry, which means that at least my observation about reference.Policy might not be 100 % correct. I will try looking in to this problem more (maybe use annotation-based triggers as was suggested in thread linked above...).

tumido commented 2 years ago

Great research! Thank you!

  1. ...but it does not generate any event, so it might be harder to detect.

I think it won't be a problem. It generates event/action via additional kubernetes resources like the creation of Image resource and ImageStreamTag resource update. That should do the trick.

  1. ...i n my test I have created ImageStream with referencePolicy.type Source and even though I deleted image from external image registry it was still able to pull image which didn't exist.

I think this is the behavior we want to achieve, correct? Cache the image in local registry and reuse it from there, without ever pulling from a remote for pod spawns. Good.

  1. ... ifNotPresent checks if an image exists with same name and tag (it doesn't consider differences in digests) and if it is available it creates a container, otherwise it pulls an image.

What does it mean if the image is available? In the internal registry or in node cache? When it is not available where does it pull from (internal registry or remote location)? Is this affected by referencePolicy.type?

  1. ... how to trigger pipelines when image changes, but as I mentioned earlier ImageStream does not generate event upon image change

Yes, something like Andrew's demo will solve it for us. Again, we don't need Events, Knative can trigger on API actions/changes, that's all we need. I was eyeballing KNative for this prior to knowing about Andrew's demo. I think we're heading in the right direction here.

tumido commented 2 years ago

In my test I've used an image stream with referencePolicy.type: Local and importPolicy.scheduled: true:

  1. Spawn a pod in JupyterHub - it pulls the image from internal registry
  2. Shut down the pod
  3. Spawn a new pod in JupyterHub - it pulls the image from node cache
  4. Shut down the pod
  5. Updating image tag in remote container image repository - After a while it triggers a scheduled import
  6. Spawning a new pod in JupyterHub - it pulls the image from internal registry
  7. Shut down the pod
  8. Delete the tag in remote container image repository - After a while the Image Stream reports import error
  9. Spawning a new pod in JupyterHub - it pulls the image from node cache
  10. Shut down the pod
  11. Mark the node of previous pods unschedulable
  12. Spawn a pod in JupyterHub - it pulls the image from internal registry

I think this solution works, wdyt @mimotej ?

mimotej commented 2 years ago

Yes, I think this configuration will give us most consistent results out of all options :+1: .

tumido commented 2 years ago

Good, then the proposed design will be:

flowchart TD
A[Image updated in remote container repository]
B[ImageStreamTag updated in internal container repository]
C[Knative eventing triggers a PipelineRun]
D[Mark phase 'Updating/Validating' and hides it from UI]
E[Run validation checks]
F{Did checks pass?}
G[Mark phase 'Success' and show in UI]
H[Mark phase 'Failed']
A --> B --> C --> D --> F
C --> E --> F
F -- Yes --> G
F -- No --> H

This approach has few drawbacks.

  1. The image stream disappears without explanation from the user facing UI (we probably need to show it as "disabled" to lower user confusion)
  2. There's a small window between validation pipeline marking the imagestream as unavailable due to running validation and the actual imagestream update potentially allowing users to click the image in Spawner UI while the image is already updated and was not yet validated.
  3. Automatic update and validation may result in image failing the validation. In such case ImageStreams don't support any rollback options. We would need to workaround that (possibly adding a backup tag pointing to the previously resolved image, but I'm not sure yet if that's possible)

Local resolution has 1 significant positive effect:

Questions: Is there a way to immediately trigger the image stream update?

Possible solution:

  1. Each ImageStream contains 2 tags which are initially the same

    • 1st tag is a "served" one, on display.
    • 2nd tag is hidden from UI and has scheduled: true.
  2. If update is received, the second tag is being updated and validated by a pipeline.

  3. During this whole time the 1st tag is still being served.

  4. If validation on the 2nd tag passes 1st tag is updated to point to this image. If it fails it stays the same and is kept/marked as outdated.

flowchart TD
A[Image updated in remote container repository]
B[ImageStreamTag updated in internal container repository on the 'backup' tag]
C[Knative eventing triggers a PipelineRun]
D[Mark phase 'Updating']
E[Run validation checks]
F{Did checks pass?}
G[Mark phase 'Success']
H[Mark phase 'Outdated Update Failed']
I[Replace pointer in visible tag with the updated backup]
A --> B --> C --> D --> F
C --> E --> F
F -- Yes --> G --> I
F -- No --> H
tumido commented 2 years ago

Decision. This is postponed/deprioritized due to the necessity to introduce a KNative dependency. Revisit post-MVP.

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

codificat commented 2 years ago

going to /lifecycle frozen this to prevent auto-closure for the time being

goern commented 2 years ago

/kind feature /priority backlog

schwesig commented 2 years ago

mentioned to @codificat

codificat commented 1 year ago

As BYON evolved into Custom Notebook Images, which currently use an operator to manage the imagestreams and pipeline runs, I believe this is not relevant anymore.

/close

Having said that, it's worth exploring what meteor-operator could do to keep custom images current. But that will be handled in its own repo and issue

sesheta commented 1 year ago

@codificat: Closing this issue.

In response to [this](https://github.com/open-services-group/byon/issues/17#issuecomment-1410817847): >As BYON evolved into Custom Notebook Images, which currently use an operator to manage the imagestreams and pipeline runs, I believe this is not relevant anymore. > >/close > >Having said that, it's worth exploring what [meteor-operator](https://github.com/thoth-station/meteor-operator) could do to keep custom images current. But that will be handled in its own repo and issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.