tkestack / vcuda-controller

Other
488 stars 156 forks source link

ECC status shows ERR!in vcuda container #7

Closed Eric-918 closed 4 years ago

Eric-918 commented 4 years ago

test environment: GPU GN7(T4)on tencent cloud,k8s 1.16,gpu-manager image ccr.ccs.tencentyun.com/tkeimages/gpu-manager:latest

nvidia-smi result : image

nvidia-smi -q -i 0 result: image

more details : root@gg-d678bdd9d-5wnfh:/usr/local/apache2# nvidia-smi -q -i 0

==============NVSMI LOG==============

Timestamp : Mon Jul 20 08:48:45 2020 Driver Version : 418.67 CUDA Version : 10.1

Attached GPUs : 1 GPU 00000000:00:08.0 Product Name : Tesla T4 Product Brand : Tesla Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 0325118127917 GPU UUID : GPU-276df694-0844-f357-45d1-e6a1ee805e17 Minor Number : 0 VBIOS Version : 90.04.38.00.03 MultiGPU Board : No Board ID : 0x8 GPU Part Number : 900-2G183-0000-001 Inforom Version Image Version : G183.0200.00.02 OEM Object : 1.1 ECC Object : 5.0 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : Pass-Through IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x00 Device : 0x08 Domain : 0x0000 Device Id : 0x1EB810DE Bus Id : 00000000:00:08.0 Sub System Id : 0x12A210DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 15079 MiB Used : 0 MiB Free : 15079 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Function Not Found Pending : Function Not Found ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : 0 Double Bit ECC : 0 Pending : No Temperature GPU Current Temp : 32 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : 85 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 9.54 W Power Limit : 70.00 W Default Power Limit : 70.00 W Enforced Power Limit : 70.00 W Min Power Limit : 60.00 W Max Power Limit : 70.00 W Clocks Graphics : 300 MHz SM : 300 MHz Memory : 405 MHz Video : 540 MHz Applications Clocks Graphics : 585 MHz Memory : 5001 MHz Default Applications Clocks Graphics : 585 MHz Memory : 5001 MHz Max Clocks Graphics : 1590 MHz SM : 1590 MHz Memory : 5001 MHz Video : 1470 MHz Max Customer Boost Clocks Graphics : 1590 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None

mYmNeo commented 4 years ago

You should use tkestack/gpu-manager not ccr.ccs.tencentyun.com/tkeimages/gpu-manager:latest