Closed oarribas closed 6 days ago
@oarribas: This pull request references OTA-1177 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.
/approve /lgtm
@kasturinarra just for curiosity do we have any test case/jobs monitoring how much an average must-gather image grows in time in context of various installations?
@oarribas are there any statistics about must-gather size? E.g. a matrix of which operators are installed -> how much data can be collected. Or, what are the variable parts that can significantly increase the size?
@sferich888 is it possible to make a matrix of all flavors of an OCP cluster? Including layered products? To see how complex a must-gather gathering can be? I am quite blind in here. Would like to extend my perspective so we can make better decisions when reviewing this kind of additions.
@ingvagabund by my count (before we add on layered products) the matrix your looking at is has 87k+ combinations in it.
>>> versions = ['4.12', '4.13', '4.14', '4.15', '4.16']
>>> IaaS_providers = ['Alibaba', 'AWS', 'Azure', 'Azure Stack Hub', 'GCP', 'IBM', 'Nutanix', 'BareMetal', 'OpenStack', 'Vsphere', 'OCI']
>>> Install_Method = ['IPI', 'UPI', 'Assisted Installer']
>>> Install_Mode = ['Connected', 'Disconnected']
>>> Deployment_Pattern = ['SingleNode', 'SingleNode+', '3C2W', '3C3I2W', '3CW', '3CW3I']
>>> ### C = Control Plane, W = Worker, I = Infrastructure, += Added workers
>>> Arch = ['x86_64', 'S390', 'ARM', 'Power']
>>>
>>> from itertools import product
>>>
>>> m_lists = [versions, IaaS_providers, Install_Method, Install_Mode, Deployment_Pattern, Arch, IaaS_providers]
>>> cp = list(product(*m_lists))
>>> len(cp)
87120
However when it come to must-gather and testing are we building a tool that works for the majority of our user base; I think the more important thing to consider is/are only about 9k of those combinations (or 10% of that matrix).
The biggest issues I have seen are related to operating at specific sizes and scales! IE: with our Deployment Patterns (combinations). We see the biggest challenges when must-gather can't find a host to run on (SingleNode Clusters or Clusters with scheduleable control planes (that are loaded with work), or has to crowd out a workload to start (people really don't like this; but its necessary). Or when we try and operate at large scales (500+ nodes; with workloads).
The biggest issues we see are with 'time to collect' data, and with how much data we collect (Note - we don't automatically compress archives (RFE for this; that hasn't been auctioned yet) - so we probably shouldn't make collection estimates based on compression). The size of our 'archive' is an issue for most customers; because they have to, in a lot of situations move the data from one system to another, just so that they can upload it to Red Hat, that is 2+ data transfers for many customers (mostly customers in disconnected or restricted network environments). Pared with the time to collect a must-gather (20+ min in some situations), we could have a customer collecting and transferring data for up to 30 to 40 min (based on some estimates).
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ingvagabund, oarribas, sferich888
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@oarribas: all tests passed!
Full PR test history. Your PR dashboard.
@sferich888 IaaS_providers
is mentioned twice in m_lists
. Is that on purpose?
[ART PR BUILD NOTIFIER]
Distgit: ose-must-gather This PR has been included in build ose-must-gather-container-v4.18.0-202409190709.p0.gab95e6a.assembly.stream.el9. All builds following this will include this PR.
/cherry-pick release-4.17
@oarribas: new pull request created: #443
Collect data from OSUS operator if installed in the cluster.