Closed BadgerOps closed 5 months ago
Just today I opened a ticket in redhat about the slowness of this tool. Listing channels of a single operator takes around ~11 minutes. Love the tool but this is kind of frustrating.
Yeah, it definitely depends on the network - but the channel listing is super slow, and when trying to check multiple operators across multiple registries, it can take a ton of time. I'm not sure what optimization/caching can be done with the current architecture, but would love for that to be possible!
@BadgerOps I've had another person from the EU region ask about oc mirror being very slow recently. 4-5 minutes or over 10 minutes in some cases. We ran this command:
oc-mirror --verbose 9 list operators --catalog=registry.redhat.io/redhat/redhat-operator-index:v4.12 --package=rhacs-operator 2>&1 | tee mirror-time.log
With the verbosity increased I can see cloudfront cache hits like
Hit from cloudfront response.status=200 OK
If you're seeing cloudfront cache misses, that could easily be part of the problem. I would think that given the number of OpenShift clusters in any region, the redhat-operator-index images should almost never have a cache miss. Even if there was one, it shouldn't happen again if you rerun the same command again.
Are you seeing any cache misses in the verbose oc-mirror logs or any other messages that indicate slowness?
I'll have to check my larger run log files, it is currently at around ~30hr run time. (Edit: I only have -v 4
set so no x-cache logs...)
For grins, here's my output of the same command, cache hitting as expected:
time oc-mirror --verbose 9 list operators --catalog=registry.redhat.io/redhat/redhat-operator-index:v4.12 --package=rhacs-operator 2>&1 | tee mirror-time.log
<snip>
NAME DISPLAY NAME DEFAULT CHANNEL
rhacs-operator Advanced Cluster Security for Kubernetes stable
PACKAGE CHANNEL HEAD
rhacs-operator latest rhacs-operator.v3.74.8
rhacs-operator rhacs-3.62 rhacs-operator.v3.62.1
rhacs-operator rhacs-3.64 rhacs-operator.v3.64.2
<snip>
rhacs-operator rhacs-4.3 rhacs-operator.v4.3.4
rhacs-operator stable rhacs-operator.v4.3.4
real 1m49.465s
user 0m53.767s
sys 0m18.740s
cat mirror-time.log | sed -n -e 's/^.*x-cache=//p' | cut -d' ' -f1-4
Hit from cloudfront response.status=200
Hit from cloudfront response.status=200
Hit from cloudfront response.status=200
<snip>
Also, to clarify, this is when running oc-mirror
to synchronize packages. Here is my exact imageset.yaml that has been running since 2024-02-05T10:36:33Z
so as of the time of this posting, we're a little over 48hr run-time and 775gb repo size.
$ du -h -s oc-mirror-workspace
775G oc-mirror-workspace
./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.15.0-202401241750.p0.g6ddf902.assembly.stream-6ddf902", GitCommit:"6ddf902e42c93a3fd1cb155d52584bb8dd912c43", GitTreeState:"clean", BuildDate:"2024-01-24T22:09:12Z", GoVersion:"go1.20.12 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
# what _does_ --continue-on-error do and should I be using it :thonk:
./oc-mirror -v 4 --continue-on-error --config imageset.yaml file://.
imageset.yaml:
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
local:
path: /var/quay/oc-mirror/offline
mirror:
platform:
architectures:
- "amd64"
channels:
- name: stable-4.12
type: ocp
minVersion: '4.12.40'
maxVersion: '4.12.40'
shortestPath: true
graph: true
operators:
- catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.12
packages:
- name: percona-postgresql-operator-certified-rhmp
- catalog: registry.redhat.io/redhat/certified-operator-index:v4.12
packages:
- name: gitlab-operator-kubernetes
- name: gitlab-runner-operator
- name: dell-csm-operator-certified
- name: splunk-operator
- catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
full: true
additionalImages:
- name: registry.redhat.io/ubi8/ubi:latest
- name: registry.redhat.io/rhel8/support-tools:latest
- name: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.8.0
- name: registry.k8s.io/sig-storage/csi-resizer:v1.8.0
- name: registry.k8s.io/sig-storage/csi-attacher:v4.3.0
- name: registry.k8s.io/sig-storage/csi-provisioner:v3.5.0
- name: registry.k8s.io/sig-storage/csi-snapshotter:v6.2.2
- name: docker.io/dellemc/csi-metadata-retriever:v1.4.0
- name: registry.access.redhat.com/ubi8/nginx-120:latest
- name: registry.gitlab.com/gitlab-org/build/cng/kubectl:v16.5.1
Note, this attempt changed from:
mirror:
operators:
- catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
packages:
- name: advanced-cluster-management
channels:
- name: release-2.6
- name: release-2.8
- name: compliance-operator
channels:
- name: stable
- name: ansible-automation-platform-operator
channels:
- name: stable-2.4-cluster-scoped
- name: container-security-operator
channels:
- name: stable-3.9
- name: file-integrity-operator
channels:
- name: stable
- name: kubernetes-nmstate-operator
channels:
- name: stable
- name: kubevirt-hyperconverged
channels:
- name: stable
- name: local-storage-operator
channels:
- name: stable
- name: mtv-operator
channels:
- name: release-v2.5
- name: odf-operator
channels:
- name: stable-4.12
- name: openshift-gitops-operator
channels:
- name: latest
- name: openshift-pipelines-operator-rh
channels:
- name: latest
- name: quay-bridge-operator
channels:
- name: stable-3.9
- name: quay-operator
channels:
- name: stable-3.9
- name: rhacs-operator
channels:
- name: stable
- name: rhsso-operator
channels:
- name: stable
- name: multicluster-engine
channels:
- name: stable-2.3
- name: stable-2.4
- name: rhbk-operator
channels:
- name: stable-v22
- name: odf-operator
channels:
- name: stable-4.12
- name: openshift-gitops-operator
channels:
- name: latest
- name: openshift-pipelines-operator-rh
channels:
- name: latest
- name: quay-bridge-operator
channels:
- name: stable-3.9
- name: quay-operator
channels:
- name: stable-3.9
- name: rhacs-operator
channels:
- name: stable
- name: rhsso-operator
channels:
- name: stable
- name: multicluster-engine
channels:
- name: stable-2.3
- name: netobserv-operator
channels:
- name: stable
- name: loki-operator
channels:
- name: stable-5.8
- name: web-terminal
channels:
- name: fast
- name: devspaces
channels:
- name: stable
- name: devworkspace-operator
channels:
- name: fast
to:
- catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
full: true
Because, we realized we were missing some necessary redhat operators (specifically the logging operator) and were hoping that just mirroring the whole catalog would help solve that problem.
I'm happy to grab any details you might want to dig into this, and will make sure to re-run with -v 9
next time so we can gather more details.
I'll have to check my larger run log files, it is currently at around ~30hr run time. (Edit: I only have
-v 4
set so no x-cache logs...)For grins, here's my output of the same command, cache hitting as expected:
time oc-mirror --verbose 9 list operators --catalog=registry.redhat.io/redhat/redhat-operator-index:v4.12 --package=rhacs-operator 2>&1 | tee mirror-time.log <snip> NAME DISPLAY NAME DEFAULT CHANNEL rhacs-operator Advanced Cluster Security for Kubernetes stable PACKAGE CHANNEL HEAD rhacs-operator latest rhacs-operator.v3.74.8 rhacs-operator rhacs-3.62 rhacs-operator.v3.62.1 rhacs-operator rhacs-3.64 rhacs-operator.v3.64.2 <snip> rhacs-operator rhacs-4.3 rhacs-operator.v4.3.4 rhacs-operator stable rhacs-operator.v4.3.4 real 1m49.465s user 0m53.767s sys 0m18.740s
cat mirror-time.log | sed -n -e 's/^.*x-cache=//p' | cut -d' ' -f1-4 Hit from cloudfront response.status=200 Hit from cloudfront response.status=200 Hit from cloudfront response.status=200 <snip>
Looks exactly the same for me
Womp, womp, ran out of disk space (Disk had 3T) after ~5.8 days. It would be awesome if there was a way to calculate what disk space would be required from an imageset.
error: failed to create archive: write /var/quay/oc-mirror/offline/mirror_seq2_000000.tar: no space left on device
real 8382m56.456s
user 377m32.351s
sys 367m54.076s
So again, given the above imageset.yaml, and ~30mb average bandwidth, it took almost 6 days to run oc-mirror. Which is a miracle that we didn't have a network reset cause it to break in that time period. Also, since we did fail, re-starting is going to start from 0, meaning another ~6 days of waiting.
Differential downloads/picking up from cached download would be very nice.
So, here we are +2 more days of attempted sync's.
I did learn that
- catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
full: true
appears to mirror every operator version instead of just the default, it seems like if I just want whatever the latest/default operator is I should just have
- catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
but I don't see that explicitely called out.
Back to the random failures. I tried running with --continue-on-error
which doesn't seem to be documented at all. I then re-run the exact same imageset to file:// sync several times in a row on the internet connected mirror, until I see no files downloaded/no errors.
I then copy the (several) mirror_seq tar files over to my disconnected network, and run oc-mirror --from /path/to/mirror_seq1 docker://internal-quay-registry:8443
- which runs for ~8min, leading me to premature celebration :tada: only then to fail with
error: error occured during image processing: error finding remote layer "sha256: <>": layer "sha256:<>" is not present in the archive.
Google, other issues, stackoverflow and the ai bots are all failing me in trying to get moved forward here. Any thoughts? Am I completely doing this wrong?
For the red hatters, I submitted a support case with the same details here - FYSA.
@BadgerOps fyi, the caching feature will be available in v2 of oc-mirror which will be released around openshift ~4.16. In my case the reason for the slowness is probably my mediocre host which could need some better specs.
Another update - we've tried quite a few different ways of consistently mirroring Platform, Operator and Container images using oc-mirror.
Given the restrictions mentioned above, and in our Support Case, we're having significant issues utilizing this tool - and really need some better guidance on how this tool is supposed to be used.
Are there people out there on restricted networks that are successfully using oc-mirror to move data across networks? Feel free to reach out to me via my profile email to coordinate a discussion.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
I'm no longer on the team dealing with this - while some of the errors seem to have gone away with a combination of more reliable quay.io / redhat.io, along with 4.14.x oc-mirror
binary improvements.
It is still extremely slow, and there's still some errors that need to be addressed, but since there hasn't really been much movement here, I'll close this out and hope that future versions of oc-mirror
focus on speed/caching/reliability.
First off, I am excited with what I'm seeing being developed here - I can see a lot of improvements coming soon, and would love to help contribute with a solution to the problem I am outlining below.
Version
What happened?
Hello team! I am on a bandwidth constrained network (~20mb average) in the EU. I am attempting to mirror something similar to the following imageset:
I'm mirroring with the following syntax:
This process takes anywhere from 500 - 1100 minutes to complete, but unfortunately often fails due to either a connection reset error (probably our network) or some upstream error, usually looks like a rate limit error.
It also seems to take forever (at least 10+ minutes on my system) to initialize the working directory (I'd love to know why a whole filesystem tree is created? :confusedbadger: ) and doesn't seem to cache anything on failure, only on a completely successful sync.
This is becoming frustrating for our team, as we're unable to sync platform updates & operators to our disconnected open shift installation reliably.
What did you expect to happen?
Reliable source of upstream updates for platform & operators
How to reproduce it (as minimally and precisely as possible)?
I will try to get some sanitized logs to provide if they would be helpful. - what specifically can I provide to help dig into this?
Anything else we need to know?
I would love to help identify and resolve the issues described above - I suspect there is some retry logic with exponential backoff that could help with some of the issues, and if there is a way to recover from a partially mirrored imageset, which I would love to dig into that possibility.
I've been trying different combinations of imagesets & oc-mirror versions (4.12, 4.14, 4.15-rc2,3,4) to try to get a reliable imageset downloaded to our internet facing server, but pretty consistently see the aforementioned problems.
Thank you, and I look forward to figuring out a good path forward - and, of course, I'd love it if I was just doing it wrong