Closed Zhitao-Li closed 4 months ago
Hey, can you tell us more about the issues you had on your newer GPU? And with which GPU you encountered issues?
We are using the current NVCCFLAGS on GPUs from the end of 2022, and see no issues there.
Hi, so we have 2 machines with the same generation of Nvidia GPUs, one of them is running CentOS 7, and the other one is running Ubuntu 22.04.3 LTS.
The CentOS 7 one is the one with the problem. I also have to modify the Makefile to include lapacke. I understand that CentOS is an outdated OS at this point, but I though I should let you guys know anyway.
On the CentOS machine, the gcc is 11.4.0, GPU driver is 515.65.01, CUDA is 11.7.
Thanks!
Hmm, do you mind posting the ouput of nvidia-smi
and nvidia-smi -q
on a system where you had to change the NVCCFLAGS? That might help us understand the underlying issue
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:17:00.0 Off | Off |
| 30% 44C P2 98W / 300W | 25100MiB / 49140MiB | 22% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:31:00.0 Off | Off |
| 30% 58C P2 164W / 300W | 31131MiB / 49140MiB | 73% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 On | 00000000:B1:00.0 Off | Off |
| 30% 28C P8 31W / 300W | 12952MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 On | 00000000:CA:00.0 Off | Off |
| 30% 41C P2 97W / 300W | 19541MiB / 49140MiB | 40% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+```
Timestamp : Tue Apr 2 16:47:53 2024
Driver Version : 515.65.01
CUDA Version : 11.7
Attached GPUs : 4
GPU 00000000:17:00.0
Product Name : NVIDIA RTX A6000
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324521059351
GPU UUID : GPU-ca78cd83-364f-e645-8bd2-33390e5a7aef
Minor Number : 0
VBIOS Version : 94.02.5C.00.02
MultiGPU Board : No
Board ID : 0x1700
GPU Part Number : 900-5G133-1700-000
Module ID : 0
Inforom Version
Image Version : G133.0500.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x17
Device : 0x00
Domain : 0x0000
Device Id : 0x223010DE
Bus Id : 00000000:17:00.0
Sub System Id : 0x145910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 30 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 49140 MiB
Reserved : 454 MiB
Used : 26786 MiB
Free : 21899 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 8 MiB
Free : 248 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 29.92 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Default Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 8001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 750.000 mV
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 57888
Type : C
Name : python
Used GPU Memory : 3151 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 67599
Type : C
Name : /usr/local/MATLAB/R2022b/bin/glnxa64/MATLAB
Used GPU Memory : 4997 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 126377
Type : C
Name : /usr/local/MATLAB/R2022b/bin/glnxa64/MATLAB
Used GPU Memory : 18633 MiB
GPU 00000000:31:00.0
Product Name : NVIDIA RTX A6000
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324521012524
GPU UUID : GPU-676b4b2c-cd66-9d9f-f771-5494e6192f41
Minor Number : 1
VBIOS Version : 94.02.5C.00.02
MultiGPU Board : No
Board ID : 0x3100
GPU Part Number : 900-5G133-1700-000
Module ID : 0
Inforom Version
Image Version : G133.0500.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x31
Device : 0x00
Domain : 0x0000
Device Id : 0x223010DE
Bus Id : 00000000:31:00.0
Sub System Id : 0x145910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 5000 KB/s
Rx Throughput : 27000 KB/s
Fan Speed : 30 %
Performance State : P2
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 49140 MiB
Reserved : 454 MiB
Used : 31087 MiB
Free : 17598 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 6 MiB
Free : 250 MiB
Compute Mode : Default
Utilization
Gpu : 25 %
Memory : 13 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 50 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 103.47 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1800 MHz
SM : 1800 MHz
Memory : 7600 MHz
Video : 1590 MHz
Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Default Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 8001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 931.250 mV
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 57888
Type : C
Name : python
Used GPU Memory : 8595 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 107143
Type : C
Name : /usr/local/MATLAB/R2022b/bin/glnxa64/MATLAB
Used GPU Memory : 22489 MiB
GPU 00000000:B1:00.0
Product Name : NVIDIA RTX A6000
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324521060624
GPU UUID : GPU-40ce844c-3875-5de5-f4e3-dbbbe31092d7
Minor Number : 2
VBIOS Version : 94.02.5C.00.02
MultiGPU Board : No
Board ID : 0xb100
GPU Part Number : 900-5G133-1700-000
Module ID : 0
Inforom Version
Image Version : G133.0500.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xB1
Device : 0x00
Domain : 0x0000
Device Id : 0x223010DE
Bus Id : 00000000:B1:00.0
Sub System Id : 0x145910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 30 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 49140 MiB
Reserved : 454 MiB
Used : 12952 MiB
Free : 35733 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 4 MiB
Free : 252 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 29 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 31.81 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 0 MHz
SM : 0 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Default Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 8001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 0.000 mV
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 126233
Type : C
Name : /opt/anaconda3/envs/torch-env2/bin/python
Used GPU Memory : 12949 MiB
GPU 00000000:CA:00.0
Product Name : NVIDIA RTX A6000
Product Brand : NVIDIA RTX
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324521060341
GPU UUID : GPU-83f78936-a6e1-3cb8-b6ab-41dd1af7bbd9
Minor Number : 3
VBIOS Version : 94.02.5C.00.02
MultiGPU Board : No
Board ID : 0xca00
GPU Part Number : 900-5G133-1700-000
Module ID : 0
Inforom Version
Image Version : G133.0500.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xCA
Device : 0x00
Domain : 0x0000
Device Id : 0x223010DE
Bus Id : 00000000:CA:00.0
Sub System Id : 0x145910DE
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 30 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 49140 MiB
Reserved : 454 MiB
Used : 19389 MiB
Free : 29296 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 6 MiB
Free : 250 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 20.79 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 0 MHz
SM : 0 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Default Applications Clocks
Graphics : 1800 MHz
Memory : 8001 MHz
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 8001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 0.000 mV
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 8753
Type : C
Name : /opt/anaconda3/envs/torch-env2/bin/python
Used GPU Memory : 981 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 75196
Type : C
Name : /usr/local/MATLAB/R2022b/bin/glnxa64/MATLAB
Used GPU Memory : 18405 MiB```
I forgot to include the above nvidia-smi -q output.
Very strange, we have GPUs that are also from that time frame and do not need to change the NVCCFLAGS
. Normally, CUDA should handle that by default and generate appropriate code, and your CUDA version definitely supports the A6000.
Still, I think we should leave this to CUDA by default.
By the way, you do not need to edit the Makefile
for this: If you create a file called Makefile.local
with contents
NVCCFLAGS += -gencode arch=compute_80,code=sm_80
it should work as well.
Thanks for the tip.
I think it might has to do with the CentOS as well. It's just old and weird at this point.
Just for your information, some newer GPUs needs the following code in order to work.
NVCCFLAGS += -gencode arch=compute_80,code=sm_80
I am not familiar enough with Makefiles to implement a more flexible solution, so I leave that to you guys.