Closed manuelbuil closed 3 weeks ago
Reproduced using VERSION=v1.31.1+k3s1 Validated using COMMIT=221ab22ca911b548d7278afb0df7fca17d2fe596
Infrastructure
p3.2xlarge instance type
00:1e.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
sudo nvidia-smi
Mon Oct 21 23:02:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:1E.0 Off | 0 |
| N/A 35C P0 25W / 300W | 1MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found //this will change very fast if you test with the vector-add image
+-----------------------------------------------------------------------------------------+
Node(s) CPU architecture, OS, and version:
Linux 6.4.0-150600.23.17-default x86_64 GNU/Linux PRETTY_NAME="SUSE Linux Enterprise Server 15 SP6"
Cluster Configuration:
NAME STATUS ROLES AGE VERSION
ip-1-1-1-23 Ready control-plane,etcd,master 53m v1.31.1+k3s-221ab22c
Config.yaml:
node-external-ip: 1.1.1.23
token: YOUR_TOKEN_HERE
write-kubeconfig-mode: 644
debug: true
cluster-init: true
embedded-registry: true
Backport fix for Nvidia operator not working correctly
11087