openshift / assisted-installer-agent

Apache License 2.0
24 stars 81 forks source link

MGMT-18923: Refactor getGPUs #787

Closed mlorenzofr closed 1 month ago

mlorenzofr commented 2 months ago

These changes fix an issue in inventory when the GPUs are discovered on the system. Due to how ghw library works, only devices with a DRM interface are considered GPUs, which excludes devices like nvidia A100 or Habana Gaudi due to them not create that interface. inventory will now check all PCI devices looking for GPU devices. To provide more flexibility in the future, a new configuration option has been provided to inventory. The configuration can be set via configuration file or via INVENTORY_GPU_CONFIG environment variable (documentation and examples provided in the README)

The inventory binary has been built (make build) and tested on my own laptop, a beaker machine, a VM and also a physical machine with Nvidia A100 hardware. In all cases, the GPUs were detected correctly, and filtering via environment variable and configuration file was also successful.

List all the issues related to this PR

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci-robot commented 2 months ago

@mlorenzofr: This pull request references MGMT-18923 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/assisted-installer-agent/pull/787): >These changes fix an issue in `inventory` when the GPUs are discovered on the system. Due to how ghw library works, only devices with a DRM interface are considered GPUs, which excludes devices like nvidia A100 or Habana Gaudi due to them not create that interface. `inventory` will now check all PCI devices looking for GPU devices. To provide more flexibility in the future, a new configuration option has been provided to `inventory`. The configuration can be set via configuration file or via **INVENTORY_GPU_CONFIG** environment variable (documentation and examples provided in the README) > >The `inventory` binary has been built (`make build`) and tested on my own laptop, a beaker machine, a VM and also a physical machine with Nvidia A100 hardware. In all cases, the GPUs were detected correctly, and filtering via environment variable and configuration file was also successful. > ># List all the issues related to this PR >- [ ] New Feature >- [x] Enhancement >- [x] Bug fix >- [x] Tests >- [x] Documentation >- [ ] CI/CD > ># How was this code tested? >- [ ] Reviewer's test appreciated >- [ ] Waiting for CI to do a full test run >- [x] Manual (Elaborate on how it was tested) >- [ ] No tests needed > ># Checklist >- [x] Title and description added to both, commit and PR. >- [ ] Relevant issues have been associated (see CONTRIBUTING guide) >- [ ] This change does not require a documentation update (docstring, docs, README, etc) >- [x] Does this change include unit-tests (note that code changes require unit-tests) > ># Reviewers Checklist >* Are the title and description (in both PR and commit) meaningful and clear? >* Is there a bug required (and linked) for this change? >* Please take a careful look if something is unclear in the code or in its workflow > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fassisted-installer-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 73.23944% with 19 lines in your changes missing coverage. Please review.

Project coverage is 59.70%. Comparing base (0ae2fc8) to head (c89a326). Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/inventory/gpu.go 76.47% 9 Missing and 7 partials :warning:
src/util/dependencies.go 0.00% 2 Missing :warning:
src/inventory/inventory.go 0.00% 1 Missing :warning:
Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/openshift/assisted-installer-agent/pull/787/graphs/tree.svg?width=650&height=150&src=pr&token=ZYXZPU4167&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift)](https://app.codecov.io/gh/openshift/assisted-installer-agent/pull/787?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift) ```diff @@ Coverage Diff @@ ## master #787 +/- ## ========================================== + Coverage 59.60% 59.70% +0.10% ========================================== Files 74 74 Lines 3755 3809 +54 ========================================== + Hits 2238 2274 +36 - Misses 1356 1367 +11 - Partials 161 168 +7 ``` | [Flag](https://app.codecov.io/gh/openshift/assisted-installer-agent/pull/787/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift) | Coverage Δ | | |---|---|---| | [](https://app.codecov.io/gh/openshift/assisted-installer-agent/pull/787/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift) | `59.70% <73.23%> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files with missing lines](https://app.codecov.io/gh/openshift/assisted-installer-agent/pull/787?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift) | Coverage Δ | | |---|---|---| | [src/inventory/inventory.go](https://app.codecov.io/gh/openshift/assisted-installer-agent/pull/787?src=pr&el=tree&filepath=src%2Finventory%2Finventory.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift#diff-c3JjL2ludmVudG9yeS9pbnZlbnRvcnkuZ28=) | `57.14% <0.00%> (ø)` | | | [src/util/dependencies.go](https://app.codecov.io/gh/openshift/assisted-installer-agent/pull/787?src=pr&el=tree&filepath=src%2Futil%2Fdependencies.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift#diff-c3JjL3V0aWwvZGVwZW5kZW5jaWVzLmdv) | `0.00% <0.00%> (ø)` | | | [src/inventory/gpu.go](https://app.codecov.io/gh/openshift/assisted-installer-agent/pull/787?src=pr&el=tree&filepath=src%2Finventory%2Fgpu.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openshift#diff-c3JjL2ludmVudG9yeS9ncHUuZ28=) | `77.77% <76.47%> (-22.23%)` | :arrow_down: |
rccrdpccl commented 2 months ago

/lgtm

rccrdpccl commented 2 months ago

/lgtm

mlorenzofr commented 2 months ago

/restest

rccrdpccl commented 2 months ago

/lgtm

rccrdpccl commented 2 months ago

/lgtm

/hold

Looks good to me, @CrystalChun would u mind taking a quick look in case I am missing something?

rccrdpccl commented 2 months ago

/approve

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mlorenzofr, rccrdpccl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/assisted-installer-agent/blob/master/OWNERS)~~ [rccrdpccl] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
mlorenzofr commented 2 months ago

/cc @ori-amizur

mlorenzofr commented 2 months ago

Does the change allow to detect other PCI devices (non GPU)?

Yes, indeed Gaudi devices are FPGAs and classified under the processing accelerators branch (0x12 instead of 0x03). However they will be added to GPU list.

But for this development the requirements were to detect Nvidia and Gaudi cards as GPUs to find them in the inventory. RHEL AI uses them and detecting this hardware using the inventory would do some work easier.

The changes added here should:

mlorenzofr commented 1 month ago

/retest

openshift-ci[bot] commented 1 month ago

@mlorenzofr: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
ori-amizur commented 1 month ago

/lgtm

rccrdpccl commented 1 month ago

/unhold

openshift-bot commented 1 month ago

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-node-agent This PR has been included in build ose-agent-installer-node-agent-container-v4.18.0-202410010938.p0.gaab6665.assembly.stream.el9. All builds following this will include this PR.