OTA-1177: Gather OSUS data

oarribas commented 5 months ago

Collect data from OSUS operator if installed in the cluster.

openshift-ci-robot commented 5 months ago

@oarribas: This pull request references OTA-1177 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/must-gather/pull/416): >Collect data from OSUS operator if installed in the cluster. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmust-gather). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

ingvagabund commented 1 month ago

/approve /lgtm

ingvagabund commented 1 month ago

@kasturinarra just for curiosity do we have any test case/jobs monitoring how much an average must-gather image grows in time in context of various installations?

@oarribas are there any statistics about must-gather size? E.g. a matrix of which operators are installed -> how much data can be collected. Or, what are the variable parts that can significantly increase the size?

@sferich888 is it possible to make a matrix of all flavors of an OCP cluster? Including layered products? To see how complex a must-gather gathering can be? I am quite blind in here. Would like to extend my perspective so we can make better decisions when reviewing this kind of additions.

oarribas commented 1 month ago

@ingvagabund , checking in OTA-1177

sferich888 commented 6 days ago

@ingvagabund by my count (before we add on layered products) the matrix your looking at is has 87k+ combinations in it.

>>> versions = ['4.12', '4.13', '4.14', '4.15', '4.16']
>>> IaaS_providers = ['Alibaba', 'AWS', 'Azure', 'Azure Stack Hub', 'GCP', 'IBM', 'Nutanix', 'BareMetal', 'OpenStack', 'Vsphere', 'OCI']
>>> Install_Method = ['IPI', 'UPI', 'Assisted Installer']
>>> Install_Mode = ['Connected', 'Disconnected']
>>> Deployment_Pattern = ['SingleNode', 'SingleNode+', '3C2W', '3C3I2W', '3CW', '3CW3I']
>>> ### C = Control Plane, W = Worker, I = Infrastructure, += Added workers
>>> Arch = ['x86_64', 'S390', 'ARM', 'Power']
>>> 
>>> from itertools import product
>>> 
>>> m_lists = [versions, IaaS_providers, Install_Method, Install_Mode, Deployment_Pattern, Arch, IaaS_providers]
>>> cp = list(product(*m_lists))
>>> len(cp)
87120

However when it come to must-gather and testing are we building a tool that works for the majority of our user base; I think the more important thing to consider is/are only about 9k of those combinations (or 10% of that matrix).

The biggest issues I have seen are related to operating at specific sizes and scales! IE: with our Deployment Patterns (combinations). We see the biggest challenges when must-gather can't find a host to run on (SingleNode Clusters or Clusters with scheduleable control planes (that are loaded with work), or has to crowd out a workload to start (people really don't like this; but its necessary). Or when we try and operate at large scales (500+ nodes; with workloads).

The biggest issues we see are with 'time to collect' data, and with how much data we collect (Note - we don't automatically compress archives (RFE for this; that hasn't been auctioned yet) - so we probably shouldn't make collection estimates based on compression). The size of our 'archive' is an issue for most customers; because they have to, in a lot of situations move the data from one system to another, just so that they can upload it to Red Hat, that is 2+ data transfers for many customers (mostly customers in disconnected or restricted network environments). Pared with the time to collect a must-gather (20+ min in some situations), we could have a customer collecting and transferring data for up to 30 to 40 min (based on some estimates).

sferich888 commented 6 days ago

/lgtm

openshift-ci[bot] commented 6 days ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ingvagabund, oarribas, sferich888

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[collection-scripts/OWNERS](https://github.com/openshift/must-gather/blob/master/collection-scripts/OWNERS)~~ [sferich888] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

openshift-ci[bot] commented 6 days ago

@oarribas: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

ingvagabund commented 5 days ago

@sferich888 IaaS_providers is mentioned twice in m_lists. Is that on purpose?

openshift-bot commented 5 days ago

[ART PR BUILD NOTIFIER]

Distgit: ose-must-gather This PR has been included in build ose-must-gather-container-v4.18.0-202409190709.p0.gab95e6a.assembly.stream.el9. All builds following this will include this PR.

oarribas commented 5 days ago

/cherry-pick release-4.17

openshift-cherrypick-robot commented 5 days ago

@oarribas: new pull request created: #443

In response to [this](https://github.com/openshift/must-gather/pull/416#issuecomment-2360646331): >/cherry-pick release-4.17 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

openshift / must-gather

OTA-1177: Gather OSUS data #416