OCPBUGS-34360: Run ppc node collection in parallel

MarSik commented 4 months ago

The PPC node data collection was running in serial and that increased the time needed to collect the whole cluster. This change executes each per-node sequence in parallel and waits for all the sub tasks.

The only disadvantage is having a per-node process (potentially hundreds in huge clusters), however this case has always been lower in the code as well. The load on the controlling must-gather pod should be low as it only executes remote commands and waits for results.

This can be tested (at least while the PR is open) by executing:

oc adm must-gather --image quay.io/msivak/must-gather:test416c

Thanks to our scale testing colleagues we have a comparison between the stock and the parallelized runs using OCP 4.16.0-rc.3 on a cluster with 120 worker nodes.

stock:

$ time oc adm must-gather
real    48m39.426s
user    0m30.998s
sys     0m14.527s
file size:5969MB

with PR applied:

$ time oc adm must-gather --image quay.io/msivak/must-gather:test416c
[must-gather      ] OUT 2024-06-06T00:55:35.965950772Z Using must-gather plug-in image: quay.io/msivak/must-gather:test416c
real    8m16.947s
user    0m34.009s
sys     0m16.852s
file size:6993MB (I checked, the size is fine, one extra hour of run time generated some extra logs on the cluster)

openshift-ci-robot commented 4 months ago

@MarSik: This pull request references Jira Issue OCPBUGS-34360, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @yliu127

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/must-gather/pull/424): >The PPC node data collection was running in serial and that increased the time needed to collect the whole cluster. This change executes each per-node sequence in parallel and waits for all the sub tasks. > >The only disadvantage is having a per-node process (potentially hundreds in huge clusters), however this case has always been lower in the code as well. The load on the controlling must-gather pod should be low as it only executes remote commands and waits for results. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmust-gather). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift-ci[bot] commented 4 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

openshift-ci-robot commented 4 months ago

@MarSik: This pull request references Jira Issue OCPBUGS-34360, which is valid.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @yliu127

In response to [this](https://github.com/openshift/must-gather/pull/424): >The PPC node data collection was running in serial and that increased the time needed to collect the whole cluster. This change executes each per-node sequence in parallel and waits for all the sub tasks. > >The only disadvantage is having a per-node process (potentially hundreds in huge clusters), however this case has always been lower in the code as well. The load on the controlling must-gather pod should be low as it only executes remote commands and waits for results. > >This can be tested (at least while the PR is open) by executing: > >``` >oc adm must-gather --image quay.io/msivak/must-gather:test416b >``` Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmust-gather). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

MarSik commented 3 months ago

@ffromani Even the simple change makes quite a difference.

ffromani commented 3 months ago

/lgtm

MarSik commented 3 months ago

/assign davemulford

MarSik commented 3 months ago

@sferich888 @davemulford Folks, who can provide the last missing approve here?

sferich888 commented 3 months ago

/lgtm

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, MarSik, sferich888

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[collection-scripts/OWNERS](https://github.com/openshift/must-gather/blob/master/collection-scripts/OWNERS)~~ [sferich888] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

openshift-ci-robot commented 3 months ago

/retest-required

Remaining retests: 0 against base HEAD 408ab8661885cc7eb2140f71f9da9bef5f303805 and 2 for PR HEAD b952bd35c19f553fbe19aaf4c1116a1d8ad1a4b4 in total

MarSik commented 3 months ago

/retest-required

openshift-ci[bot] commented 3 months ago

@MarSik: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

openshift-ci-robot commented 3 months ago

@MarSik: Jira Issue OCPBUGS-34360: All pull requests linked via external trackers have merged:

openshift/must-gather#424

Jira Issue OCPBUGS-34360 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/must-gather/pull/424): >The PPC node data collection was running in serial and that increased the time needed to collect the whole cluster. This change executes each per-node sequence in parallel and waits for all the sub tasks. > >The only disadvantage is having a per-node process (potentially hundreds in huge clusters), however this case has always been lower in the code as well. The load on the controlling must-gather pod should be low as it only executes remote commands and waits for results. > >This can be tested (at least while the PR is open) by executing: > >``` >oc adm must-gather --image quay.io/msivak/must-gather:test416c >``` > >Thanks to our scale testing colleagues we have a comparison between the stock and the parallelized runs using OCP 4.16.0-rc.3 on a cluster with 120 worker nodes. > >stock: > >``` >$ time oc adm must-gather >real 48m39.426s >user 0m30.998s >sys 0m14.527s >file size:5969MB >``` > >with PR applied: > >``` >$ time oc adm must-gather --image quay.io/msivak/must-gather:test416c >[must-gather ] OUT 2024-06-06T00:55:35.965950772Z Using must-gather plug-in image: quay.io/msivak/must-gather:test416c >real 8m16.947s >user 0m34.009s >sys 0m16.852s >file size:6993MB (I checked, the size is fine, one extra hour of run time generated some extra logs on the cluster) >``` Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmust-gather). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

MarSik commented 3 months ago

/cherry-pick release-4.16

openshift-cherrypick-robot commented 3 months ago

@MarSik: new pull request created: #426

In response to [this](https://github.com/openshift/must-gather/pull/424#issuecomment-2163197807): >/cherry-pick release-4.16 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

openshift-bot commented 3 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-must-gather-container-v4.17.0-202406121512.p0.gb8be7da.assembly.stream.el9 for distgit ose-must-gather. All builds following this will include this PR.

openshift / must-gather

OCPBUGS-34360: Run ppc node collection in parallel #424