openshift / oc-mirror

Lifecycle manager for internet-disconnected OpenShift environments
Apache License 2.0
91 stars 82 forks source link

oc mirror very slow, failure prone / inconsistent on bandwidth constrained network #793

Closed BadgerOps closed 5 months ago

BadgerOps commented 9 months ago

First off, I am excited with what I'm seeing being developed here - I can see a lot of improvements coming soon, and would love to help contribute with a solution to the problem I am outlining below.

Version

$ oc-mirror version
4.12 - 4.15-rc.4

What happened?

Hello team! I am on a bandwidth constrained network (~20mb average) in the EU. I am attempting to mirror something similar to the following imageset:

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /var/mirror/offline
mirror:
  platform:
    channels:
    - name: stable-4.12
      minVersion: '4.12.33'
      maxVersion: '4.12.40' # (I've also tried with min/max being the same version...)
      shortestPath: true
    graph: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.12
      packages:
      - name: percona-postgresql-operator-certified-rhmp
      - < snip - a couple more marketplace operators here >
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      packages:
      - name: rhbk-operator
        channels:
        - name: stable-v22
      < snip, but about 10 other red hat operators >
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest
  - < snip  - about 3 other upstream images >

I'm mirroring with the following syntax:

oc-mirror -v 4  --config imageset.yaml file:///var/mirror/offline

This process takes anywhere from 500 - 1100 minutes to complete, but unfortunately often fails due to either a connection reset error (probably our network) or some upstream error, usually looks like a rate limit error.

It also seems to take forever (at least 10+ minutes on my system) to initialize the working directory (I'd love to know why a whole filesystem tree is created? :confusedbadger: ) and doesn't seem to cache anything on failure, only on a completely successful sync.

This is becoming frustrating for our team, as we're unable to sync platform updates & operators to our disconnected open shift installation reliably.

What did you expect to happen?

Reliable source of upstream updates for platform & operators

How to reproduce it (as minimally and precisely as possible)?

I will try to get some sanitized logs to provide if they would be helpful. - what specifically can I provide to help dig into this?

Anything else we need to know?

I would love to help identify and resolve the issues described above - I suspect there is some retry logic with exponential backoff that could help with some of the issues, and if there is a way to recover from a partially mirrored imageset, which I would love to dig into that possibility.

I've been trying different combinations of imagesets & oc-mirror versions (4.12, 4.14, 4.15-rc2,3,4) to try to get a reliable imageset downloaded to our internet facing server, but pretty consistently see the aforementioned problems.

Thank you, and I look forward to figuring out a good path forward - and, of course, I'd love it if I was just doing it wrong

dadav commented 9 months ago

Just today I opened a ticket in redhat about the slowness of this tool. Listing channels of a single operator takes around ~11 minutes. Love the tool but this is kind of frustrating.

BadgerOps commented 9 months ago

Yeah, it definitely depends on the network - but the channel listing is super slow, and when trying to check multiple operators across multiple registries, it can take a ton of time. I'm not sure what optimization/caching can be done with the current architecture, but would love for that to be possible!

dmc5179 commented 9 months ago

@BadgerOps I've had another person from the EU region ask about oc mirror being very slow recently. 4-5 minutes or over 10 minutes in some cases. We ran this command:

oc-mirror --verbose 9 list operators --catalog=registry.redhat.io/redhat/redhat-operator-index:v4.12 --package=rhacs-operator 2>&1 | tee mirror-time.log

With the verbosity increased I can see cloudfront cache hits like

Hit from cloudfront response.status=200 OK

If you're seeing cloudfront cache misses, that could easily be part of the problem. I would think that given the number of OpenShift clusters in any region, the redhat-operator-index images should almost never have a cache miss. Even if there was one, it shouldn't happen again if you rerun the same command again.

Are you seeing any cache misses in the verbose oc-mirror logs or any other messages that indicate slowness?

BadgerOps commented 9 months ago

I'll have to check my larger run log files, it is currently at around ~30hr run time. (Edit: I only have -v 4 set so no x-cache logs...)

For grins, here's my output of the same command, cache hitting as expected:

time oc-mirror --verbose 9 list operators --catalog=registry.redhat.io/redhat/redhat-operator-index:v4.12 --package=rhacs-operator 2>&1 | tee mirror-time.log
<snip>
NAME            DISPLAY NAME                              DEFAULT CHANNEL
rhacs-operator  Advanced Cluster Security for Kubernetes  stable

PACKAGE         CHANNEL     HEAD
rhacs-operator  latest      rhacs-operator.v3.74.8
rhacs-operator  rhacs-3.62  rhacs-operator.v3.62.1
rhacs-operator  rhacs-3.64  rhacs-operator.v3.64.2
<snip>
rhacs-operator  rhacs-4.3   rhacs-operator.v4.3.4
rhacs-operator  stable      rhacs-operator.v4.3.4

real    1m49.465s
user    0m53.767s
sys     0m18.740s
cat mirror-time.log | sed -n -e 's/^.*x-cache=//p' | cut -d' ' -f1-4
Hit from cloudfront response.status=200
Hit from cloudfront response.status=200
Hit from cloudfront response.status=200
<snip>
BadgerOps commented 9 months ago

Also, to clarify, this is when running oc-mirror to synchronize packages. Here is my exact imageset.yaml that has been running since 2024-02-05T10:36:33Z so as of the time of this posting, we're a little over 48hr run-time and 775gb repo size.

$ du -h -s oc-mirror-workspace
775G    oc-mirror-workspace
./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.15.0-202401241750.p0.g6ddf902.assembly.stream-6ddf902", GitCommit:"6ddf902e42c93a3fd1cb155d52584bb8dd912c43", GitTreeState:"clean", BuildDate:"2024-01-24T22:09:12Z", GoVersion:"go1.20.12 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

# what _does_ --continue-on-error do and should I be using it :thonk: 

./oc-mirror -v 4 --continue-on-error --config imageset.yaml file://.

imageset.yaml:

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /var/quay/oc-mirror/offline
mirror:
  platform:
    architectures:
      - "amd64"
    channels:
    - name: stable-4.12
      type: ocp
      minVersion: '4.12.40'
      maxVersion: '4.12.40'
      shortestPath: true
    graph: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.12
      packages:
      - name: percona-postgresql-operator-certified-rhmp
    - catalog: registry.redhat.io/redhat/certified-operator-index:v4.12
      packages:
      - name: gitlab-operator-kubernetes
      - name: gitlab-runner-operator
      - name: dell-csm-operator-certified
      - name: splunk-operator
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      full: true
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest
  - name: registry.redhat.io/rhel8/support-tools:latest
  - name: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.8.0
  - name: registry.k8s.io/sig-storage/csi-resizer:v1.8.0
  - name: registry.k8s.io/sig-storage/csi-attacher:v4.3.0
  - name: registry.k8s.io/sig-storage/csi-provisioner:v3.5.0
  - name: registry.k8s.io/sig-storage/csi-snapshotter:v6.2.2
  - name: docker.io/dellemc/csi-metadata-retriever:v1.4.0
  - name: registry.access.redhat.com/ubi8/nginx-120:latest
  - name: registry.gitlab.com/gitlab-org/build/cng/kubectl:v16.5.1

Note, this attempt changed from:

mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      packages:
      - name: advanced-cluster-management                                  
        channels:
        - name: release-2.6
        - name: release-2.8             
      - name: compliance-operator
        channels:
        - name: stable
      - name: ansible-automation-platform-operator                              
        channels:
        - name:  stable-2.4-cluster-scoped
      - name: container-security-operator                                  
        channels:
        - name: stable-3.9
      - name: file-integrity-operator                                  
        channels:
        - name: stable 
      - name: kubernetes-nmstate-operator                                  
        channels:
        - name: stable 
      - name: kubevirt-hyperconverged                                
        channels:
        - name: stable
      - name: local-storage-operator                                
        channels:
        - name: stable 
      - name: mtv-operator                                 
        channels:
        - name: release-v2.5 
      - name: odf-operator                                
        channels:
        - name: stable-4.12
      - name: openshift-gitops-operator
        channels:
        - name: latest
      - name: openshift-pipelines-operator-rh
        channels:
        - name: latest
      - name: quay-bridge-operator
        channels:
        - name: stable-3.9
      - name: quay-operator
        channels:
        - name: stable-3.9
      - name: rhacs-operator
        channels:
        - name: stable
      - name: rhsso-operator
        channels:
        - name: stable
      - name: multicluster-engine
        channels:
        - name: stable-2.3
        - name: stable-2.4
      - name: rhbk-operator
        channels:
        - name: stable-v22
      - name: odf-operator
        channels:
        - name: stable-4.12
      - name: openshift-gitops-operator
        channels:
        - name: latest
      - name: openshift-pipelines-operator-rh
        channels:
        - name: latest
      - name: quay-bridge-operator
        channels:
        - name: stable-3.9
      - name: quay-operator
        channels:
        - name: stable-3.9
      - name: rhacs-operator
        channels:
        - name: stable
      - name: rhsso-operator
        channels:
        - name: stable
      - name: multicluster-engine
        channels:
        - name: stable-2.3
      - name: netobserv-operator
        channels:
        - name: stable
      - name: loki-operator
        channels:
        - name: stable-5.8
      - name: web-terminal
        channels:
        - name: fast
      - name: devspaces
        channels:
        - name: stable
      - name: devworkspace-operator
        channels:
        - name: fast

to:

- catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      full: true

Because, we realized we were missing some necessary redhat operators (specifically the logging operator) and were hoping that just mirroring the whole catalog would help solve that problem.

I'm happy to grab any details you might want to dig into this, and will make sure to re-run with -v 9 next time so we can gather more details.

dadav commented 9 months ago

I'll have to check my larger run log files, it is currently at around ~30hr run time. (Edit: I only have -v 4 set so no x-cache logs...)

For grins, here's my output of the same command, cache hitting as expected:

time oc-mirror --verbose 9 list operators --catalog=registry.redhat.io/redhat/redhat-operator-index:v4.12 --package=rhacs-operator 2>&1 | tee mirror-time.log
<snip>
NAME            DISPLAY NAME                              DEFAULT CHANNEL
rhacs-operator  Advanced Cluster Security for Kubernetes  stable

PACKAGE         CHANNEL     HEAD
rhacs-operator  latest      rhacs-operator.v3.74.8
rhacs-operator  rhacs-3.62  rhacs-operator.v3.62.1
rhacs-operator  rhacs-3.64  rhacs-operator.v3.64.2
<snip>
rhacs-operator  rhacs-4.3   rhacs-operator.v4.3.4
rhacs-operator  stable      rhacs-operator.v4.3.4

real    1m49.465s
user    0m53.767s
sys     0m18.740s
cat mirror-time.log | sed -n -e 's/^.*x-cache=//p' | cut -d' ' -f1-4
Hit from cloudfront response.status=200
Hit from cloudfront response.status=200
Hit from cloudfront response.status=200
<snip>

Looks exactly the same for me

BadgerOps commented 9 months ago

Womp, womp, ran out of disk space (Disk had 3T) after ~5.8 days. It would be awesome if there was a way to calculate what disk space would be required from an imageset.

error: failed to create archive: write /var/quay/oc-mirror/offline/mirror_seq2_000000.tar: no space left on device

real    8382m56.456s
user    377m32.351s
sys     367m54.076s

So again, given the above imageset.yaml, and ~30mb average bandwidth, it took almost 6 days to run oc-mirror. Which is a miracle that we didn't have a network reset cause it to break in that time period. Also, since we did fail, re-starting is going to start from 0, meaning another ~6 days of waiting.

Differential downloads/picking up from cached download would be very nice.

BadgerOps commented 9 months ago

So, here we are +2 more days of attempted sync's.

I did learn that

- catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      full: true

appears to mirror every operator version instead of just the default, it seems like if I just want whatever the latest/default operator is I should just have

- catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12

but I don't see that explicitely called out.

Back to the random failures. I tried running with --continue-on-error which doesn't seem to be documented at all. I then re-run the exact same imageset to file:// sync several times in a row on the internet connected mirror, until I see no files downloaded/no errors.

I then copy the (several) mirror_seq tar files over to my disconnected network, and run oc-mirror --from /path/to/mirror_seq1 docker://internal-quay-registry:8443 - which runs for ~8min, leading me to premature celebration :tada: only then to fail with

error: error occured during image processing: error finding remote layer "sha256: <>": layer "sha256:<>" is not present in the archive.

Google, other issues, stackoverflow and the ai bots are all failing me in trying to get moved forward here. Any thoughts? Am I completely doing this wrong?

BadgerOps commented 9 months ago

For the red hatters, I submitted a support case with the same details here - FYSA.

dadav commented 9 months ago

@BadgerOps fyi, the caching feature will be available in v2 of oc-mirror which will be released around openshift ~4.16. In my case the reason for the slowness is probably my mediocre host which could need some better specs.

BadgerOps commented 8 months ago

Another update - we've tried quite a few different ways of consistently mirroring Platform, Operator and Container images using oc-mirror.

Given the restrictions mentioned above, and in our Support Case, we're having significant issues utilizing this tool - and really need some better guidance on how this tool is supposed to be used.

Are there people out there on restricted networks that are successfully using oc-mirror to move data across networks? Feel free to reach out to me via my profile email to coordinate a discussion.

openshift-bot commented 5 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

BadgerOps commented 5 months ago

I'm no longer on the team dealing with this - while some of the errors seem to have gone away with a combination of more reliable quay.io / redhat.io, along with 4.14.x oc-mirror binary improvements.

It is still extremely slow, and there's still some errors that need to be addressed, but since there hasn't really been much movement here, I'll close this out and hope that future versions of oc-mirror focus on speed/caching/reliability.