kubernetes / minikube

Run Kubernetes locally
https://minikube.sigs.k8s.io/
Apache License 2.0
29.24k stars 4.87k forks source link

Set `NVIDIA_DRIVER_CAPABILITIES` to `all` when GPU is enabled #19345

Closed chubei-urus closed 1 month ago

chubei-urus commented 2 months ago

fixes #19318

linux-foundation-easycla[bot] commented 2 months ago

CLA Signed


The committers listed above are authorized under a signed CLA.

k8s-ci-robot commented 2 months ago

Welcome @chubei-urus!

It looks like this is your first PR to kubernetes/minikube 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/minikube has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot commented 2 months ago

Hi @chubei-urus. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
minikube-bot commented 2 months ago

Can one of the admins verify this patch?

chubei-urus commented 2 months ago

I'm new to the repo and don't know how this feature should be tested. Many thanks to anyone who can give some pointers!

medyagh commented 2 months ago

Thank you @chubei-urus for creating this PR, do you mind sharing a Before After this PR Example of running a workload and how did you verify that it was NOT using the graphic card before this PR ?

chubei-urus commented 2 months ago

Thank you for your quick reply. I'll create a minimal example.

medyagh commented 2 months ago

/ok-to-test

minikube-pr-bot commented 2 months ago

kvm2 driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 19345) |
+----------------+----------+---------------------+
| minikube start | 49.8s    | 49.4s               |
| enable ingress | 26.5s    | 25.0s               |
+----------------+----------+---------------------+
Times for minikube start: 52.0s 46.2s 49.2s 50.8s 50.7s Times for minikube (PR 19345) start: 51.2s 49.4s 50.6s 48.4s 47.5s Times for minikube ingress: 29.0s 27.0s 24.9s 27.0s 24.4s Times for minikube (PR 19345) ingress: 27.5s 24.9s 23.9s 24.9s 24.0s

docker driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 19345) |
+----------------+----------+---------------------+
| minikube start | 23.1s    | 22.2s               |
| enable ingress | 21.4s    | 22.1s               |
+----------------+----------+---------------------+
Times for minikube start: 23.9s 23.6s 23.2s 20.9s 23.8s Times for minikube (PR 19345) start: 21.5s 22.4s 21.4s 21.8s 24.0s Times for minikube (PR 19345) ingress: 22.7s 21.8s 21.7s 21.7s 22.7s Times for minikube ingress: 21.2s 21.7s 21.3s 21.7s 21.2s

docker driver with containerd runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 19345) |
+----------------+----------+---------------------+
| minikube start | 21.3s    | 21.6s               |
| enable ingress | 48.2s    | 48.1s               |
+----------------+----------+---------------------+
Times for minikube start: 22.8s 20.8s 19.9s 23.6s 19.6s Times for minikube (PR 19345) start: 19.9s 20.0s 22.6s 23.0s 22.7s Times for minikube ingress: 48.3s 48.2s 48.2s 48.2s 48.2s Times for minikube (PR 19345) ingress: 48.2s 48.2s 48.3s 48.2s 47.8s
minikube-pr-bot commented 2 months ago

Here are the number of top 10 failed tests in each environments with lowest flake rate.

Environment Test Name Flake Rate

Besides the following environments also have failed tests:

To see the flake rates of all tests by environment, click here.

chubei-urus commented 2 months ago

Steps

  1. Follow https://minikube.sigs.k8s.io/docs/tutorials/nvidia/ to set up GPU support with docker driver
  2. minikube start --gpus all
  3. Create vulkan.yaml with following content.
    apiVersion: v1
    kind: Pod
    metadata:
    name: vulkan
    spec:
    containers:
    - name: vulkan
    env:
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "graphics"
    image: dualvtable/vulkan-sample
    resources:
      limits:
        nvidia.com/gpu: 1
    restartPolicy: Never
  4. kubectl apply -f vulkan.yaml
  5. Wait for the container to finish, then kubectl logs vulkan

Before

The logs look like:

computeheadless: /build/Vulkan/examples/computeheadless/computeheadless.cpp:181: VulkanExample::VulkanExample(): Assertion `res == VK_SUCCESS' failed.
/build/entrypoint.sh: line 4:    14 Done                    echo 'y'
        15 Aborted                 (core dumped) | ${EXAMPLES}/$i
\n
renderheadless: /build/Vulkan/examples/renderheadless/renderheadless.cpp:211: VulkanExample::VulkanExample(): Assertion `res == VK_SUCCESS' failed.
/build/entrypoint.sh: line 4:    16 Done                    echo 'y'
        17 Aborted                 (core dumped) | ${EXAMPLES}/$i
\n

After

The logs look like:

Running headless compute example
GPU: NVIDIA GeForce RTX 4060 Laptop GPU
Compute input:
0       1       2       3       4       5       6       7       8       9       10      11      12      13      14      15      16      17      18      19      20      21      22      23      24      25      26      27      28      29      30      31 
Compute output:
0       1       1       2       3       5       8       13      21      34      55      89      144     233     377     610     987     1597    2584    4181    6765    10946   17711   28657   46368   75025   121393  196418  317811  514229  832040  1346269 
Finished. Press enter to terminate...\n
Running headless rendering example
GPU: NVIDIA GeForce RTX 4060 Laptop GPU
Framebuffer image saved to headless.ppm
Finished. Press enter to terminate...\n

Tested on

(base) bei@bei-urus:~/minikube$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04 LTS
Release:        24.04
Codename:       noble
(base) bei@bei-urus:~/minikube$ nvidia-smi 
Tue Jul 30 10:08:20 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   42C    P4              10W /  35W |    827MiB /  8188MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2613      G   /usr/lib/xorg/Xorg                          271MiB |
|    0   N/A  N/A      2942      G   /usr/bin/gnome-shell                        172MiB |
|    0   N/A  N/A      3914      G   ...yOnDemand --variations-seed-version       91MiB |
|    0   N/A  N/A      4839      G   ...seed-version=20240729-050126.230000      109MiB |
|    0   N/A  N/A      6026      G   ...erProcess --variations-seed-version      137MiB |
+---------------------------------------------------------------------------------------+

Note that this is not the workload I was running, but I believe it shows the same issue.

medyagh commented 1 month ago

@chubei-urus I could merge this PR and if you like I would love to see a follow up adding integraiton test chubei-urus. https://github.com/kubernetes/minikube/issues/19486

medyagh commented 1 month ago

/lgtm

k8s-ci-robot commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chubei-urus, medyagh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes/minikube/blob/master/OWNERS)~~ [medyagh] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
chubei-urus commented 1 month ago

Thank you I'd like an integration test but have been busy with other things.