vultr / slik

Slurm in Kubernetes
https://vultr.com
Apache License 2.0
34 stars 5 forks source link

[BUG] - Image is not available to pull #9

Closed odellem closed 4 months ago

odellem commented 4 months ago

Describe the bug Pods get ErrImagePull and cannot pull from the public repository

To Reproduce Steps to reproduce the behavior:

  1. Install the operator via Helm
  2. Pod cannot start because it needs to pull from the repository

Expected behavior The pod starts.

Desktop (please complete the following information where applicable:

Additional context

happytreees commented 4 months ago

Hello @odellem

We are seeing a lot of successful anonymous pulls on this public repo so it does not appear to be anything widespread. Can you share the following:

  1. What country is the pull being attempted from?
  2. Can you share the full event output of the failing pod?
odellem commented 4 months ago
  1. US East
  2. 
    16h                    Warning   Failed                            Pod/slinkee-operator-5d684988-wd7jx    Failed to pull image "ewr.vultrcr.com/slurm/slinkee:v0.0.1": failed to pull and unpack image "ewr.vultrcr.com/slurm/slinkee:v0.0.1": failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://cne-ewr-minio-000.vultr.dev/vcr-ewr/docker/registry/v2/blobs/sha256/25/25d895424c791d82fafe605921a05618ff2262447a470d53a93dbc66e84c1fa6/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=yg1F6lopMheulWYqqHf1%2F20240613%2Fus-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240613T221814Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=066881ddb3156aaae5a6191c3237b5556120ed4034dd743b95e17d45ba0da062": EOF
    16h                    Warning   Failed                            Pod/slinkee-operator-5d684988-wd7jx    Failed to pull image "ewr.vultrcr.com/slurm/slinkee:v0.0.1": failed to pull and unpack image "ewr.vultrcr.com/slurm/slinkee:v0.0.1": failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://cne-ewr-minio-000.vultr.dev/vcr-ewr/docker/registry/v2/blobs/sha256/25/25d895424c791d82fafe605921a05618ff2262447a470d53a93dbc66e84c1fa6/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=yg1F6lopMheulWYqqHf1%2F20240613%2Fus-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240613T221831Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=ffc03ed230e844705ca3bd934f25daf947cd7f8fe1eec6555170e7551d2db803": EOF
    16h (x3 over 16h)      Normal    Pulling                           Pod/slinkee-operator-5d684988-wd7jx    Pulling image "ewr.vultrcr.com/slurm/slinkee:v0.0.1"
    16h (x3 over 16h)      Warning   Failed                            Pod/slinkee-operator-5d684988-wd7jx    Error: ErrImagePull
    16h                    Warning   Failed                            Pod/slinkee-operator-5d684988-wd7jx    Failed to pull image "ewr.vultrcr.com/slurm/slinkee:v0.0.1": failed to pull and unpack image "ewr.vultrcr.com/slurm/slinkee:v0.0.1": failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://cne-ewr-minio-000.vultr.dev/vcr-ewr/docker/registry/v2/blobs/sha256/25/25d895424c791d82fafe605921a05618ff2262447a470d53a93dbc66e84c1fa6/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=yg1F6lopMheulWYqqHf1%2F20240613%2Fus-west-1%2Fs3%2Faws4_request&X-Amz-Date=20240613T221859Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=6f19ad539dc4fecf72e88fccede5309ac325fc502d73a400cb0b1845aa710290": EOF
    16h (x4 over 16h)      Normal    BackOff                           Pod/slinkee-operator-5d684988-wd7jx    Back-off pulling image "ewr.vultrcr.com/slurm/slinkee:v0.0.1"
    16h (x4 over 16h)      Warning   Failed                            Pod/slinkee-operator-5d684988-wd7jx    Error: ImagePullBackOff
    115s (x229 over 16h)   Warning   FailedToRetrieveImagePullSecret   Pod/slinkee-operator-5d684988-wd7jx    Unable to retrieve some image pull secrets (vcr); attempting to pull the image may not succeed.
happytreees commented 4 months ago

Hey @odellem thank you for providing that. It looks like you are seeing an EOF which is likely due to networking latency between the machine pulling the images and the CR. It's a rather small image(150mb) so there aren't any limiters in-place that would prevent that from going through other than just latency.

I attempted a pull from an AWS ec2 instance in us-east2 and there were zero issues, so pulls from outside the Vultr network appear operational as well.

# docker pull ewr.vultrcr.com/slurm/slinkee:latest
latest: Pulling from slurm/slinkee
...
Status: Downloaded newer image for ewr.vultrcr.com/slurm/slinkee:latest
ewr.vultrcr.com/slurm/slinkee:latest

We will explore adding additional mirrors of the image to other CR's, however, at this time it is only uploaded to Vultr CR. One thing you may do to get around this is to download the image onto another machine and upload it to a CR of your choice and simply modify the helm values to reflect the new repo.

I will be closing this issue as it does not appear to be a larger issue outside of possible latency.

odellem commented 4 months ago

When I try switching machines the other machines had the same issues. I am assuming this is an IT firewall issue. Thanks for doing the sanity check for me