preda / gpuowl

GPU Mersenne primality test.
GNU General Public License v3.0
163 stars 39 forks source link

Card0 iGPU lacks uniqueid file #167

Closed valeriob01 closed 8 months ago

valeriob01 commented 4 years ago

What rocm-smi outputs:

/opt/rocm-3.3.0/bin/rocm-smi --showuniqueid

========================ROCm System Management Interface========================
================================================================================
GPU[1]      : Unique ID: 592c190172fd5d40
GPU[2]      : Unique ID: c6f220c172dc76bb
================================================================================
==============================End of ROCm SMI Log ==============================

What gpuOwl reads:

 0  : gfx906+sram-ecc-Vega 20 [Radeon VII] AMD
 1 592c190172fd5d40 : gfx906+sram-ecc-Vega 20 [Radeon VII] AMD
valeriob01 commented 4 years ago

seems like in gpuowl output GPU[1] is 1, but GPU[2] is 0.

valeriob01 commented 4 years ago

listing of cardx/device , they differ, and card0 lacks unique_id file, because they are numbered differently, look here, this system has 2 Radeon VII::

`/sys/class/drm# ls card0/device ari_enabled class current_link_speed device driver_override firmware_node i2c-2 irq local_cpulist max_link_width msi_irqs remove resource resource2_wc rom subsystem_vendor boot_vga config current_link_width dma_mask_bits drm graphics i2c-3 label local_cpus modalias numa_node rescan resource0 resource4 subsystem uevent

/sys/class/drm# ls card1/device aer_dev_correctable config dma_mask_bits hwmon local_cpulist mem_info_vis_vram_total msi_irqs pp_cur_state pp_features product_name resource0_wc serial_number vendor aer_dev_fatal consistent_dma_mask_bits driver i2c-10 local_cpus mem_info_vis_vram_used numa_node pp_dpm_dcefclk pp_force_state product_number resource2 subsystem aer_dev_nonfatal current_link_speed driver_override i2c-4 max_link_speed mem_info_vram_total pcie_bw pp_dpm_fclk pp_mclk_od remove resource2_wc subsystem_device ari_enabled current_link_width drm i2c-6 max_link_width mem_info_vram_used pcie_replay_count pp_dpm_mclk pp_num_states rescan resource4 subsystem_vendor boot_vga d3cold_allowed enable i2c-8 mem_busy_percent mem_info_vram_vendor power pp_dpm_pcie pp_power_profile_mode reset resource5 uevent broken_parity_status device fw_version irq mem_info_gtt_total modalias power_dpm_force_performance_level pp_dpm_sclk pp_sclk_od resource revision unique_id class df_cntr_avail gpu_busy_percent link mem_info_gtt_used msi_bus power_dpm_state pp_dpm_socclk pp_table resource0 rom vbios_version

/sys/class/drm# ls card2/device aer_dev_correctable config dma_mask_bits hwmon local_cpulist mem_info_vis_vram_total msi_irqs pp_cur_state pp_features product_name resource0_wc serial_number vendor aer_dev_fatal consistent_dma_mask_bits driver i2c-12 local_cpus mem_info_vis_vram_used numa_node pp_dpm_dcefclk pp_force_state product_number resource2 subsystem aer_dev_nonfatal current_link_speed driver_override i2c-14 max_link_speed mem_info_vram_total pcie_bw pp_dpm_fclk pp_mclk_od remove resource2_wc subsystem_device ari_enabled current_link_width drm i2c-16 max_link_width mem_info_vram_used pcie_replay_count pp_dpm_mclk pp_num_states rescan resource4 subsystem_vendor boot_vga d3cold_allowed enable i2c-18 mem_busy_percent mem_info_vram_vendor power pp_dpm_pcie pp_power_profile_mode reset resource5 uevent broken_parity_status device fw_version irq mem_info_gtt_total modalias power_dpm_force_performance_level pp_dpm_sclk pp_sclk_od resource revision unique_id class df_cntr_avail gpu_busy_percent link mem_info_gtt_used msi_bus power_dpm_state pp_dpm_socclk pp_table resource0 rom vbios_version

`

/sys/class/drm# ls
amdttm  card0  card0-DP-1  card0-HDMI-A-1  card1  card1-DP-2  card1-DP-3  card1-DP-4  card1-HDMI-A-2  card2  card2-DP-5  card2-DP-6  card2-DP-7  card2-HDMI-A-3  renderD128  renderD129  renderD130  version

there is card0, card1, card2:

root@sel:/sys/class/drm# cat card0/device/unique_id
cat: card0/device/unique_id: No such file or directory
root@sel:/sys/class/drm# cat card1/device/unique_id
592c190172fd5d40
root@sel:/sys/class/drm# cat card2/device/unique_id
c6f220c172dc76bb

Thus, to read correctly all the UUIDs you must start at 1 not at 0.

valeriob01 commented 4 years ago

seems like in gpuowl output GPU[1] is 1, but GPU[2] is 0.

no, you are just reading card0 which lacks unique_id and is not a discrete GPU.

valeriob01 commented 4 years ago

Please see https://github.com/preda/gpuowl/pull/170

rocm-smi output:


root@sel:/home/sel/test/gpuowl# /opt/rocm-3.3.0/bin/rocm-smi --showuniqueid

========================ROCm System Management Interface========================
================================================================================
GPU[1]      : Unique ID: 592c190172fd5d40
GPU[2]      : Unique ID: c6f220c172dc76bb
================================================================================
==============================End of ROCm SMI Log ==============================

gpuowl output:

-device <N>        : select a specific device:
 0 592c190172fd5d40 : gfx906+sram-ecc-Vega 20 [Radeon VII] AMD
 1 c6f220c172dc76bb : gfx906+sram-ecc-Vega 20 [Radeon VII] AMD
preda commented 4 years ago

Yes I'm aware of the problem, I confirm it's a bug. The fix is not easy though. OpenCL does not make it easy to link the "logical" enumarion to the cardN list.

valeriob01 commented 4 years ago

Theoretically it would suffice to try to open card0/device/unique_id, if it doesn't exists then card0 is a an iGPU, and we should skip it. Please take my fix as a smoke-test for this concept.

valeriob01 commented 4 years ago

I think the fix will work on every mainboard with integrated graphics.

valeriob01 commented 4 years ago

Even better, for each board cardN, first test for existence of cardN/device/unique_id file, if it doesn't exist skip the board, with high probability it is an iGPU.

preda commented 4 years ago

iGPUs are not the only ones that don't have unique_id. In fact only a subset of GPUs under ROCm have unique_id. As a clear example, Nvidia GPUs most likely don't expose unique_id in the same way ROCm does.

valeriob01 commented 4 years ago

Yes, the ROCm documentation states clearly that the unique_id file is available only for AMD GFX9, even AMD GFX8 lacks unique_id. Thus, first test for AMD, then test for GFX9 (maybe just test for "Radeon VII" string?), then for unique_id file...

valeriob01 commented 4 years ago

Yes, the ROCm documentation states clearly that the unique_id file is available only for AMD GFX9, even AMD GFX8 lacks unique_id. Thus, first test for AMD, then test for GFX9 (maybe just test for "Radeon VII" string?), then for unique_id file...

Well, at this point it is sufficient to test for AMD and GFX9.

valeriob01 commented 4 years ago

Please see https://github.com/RadeonOpenCompute/ROCm/issues/310#issuecomment-581435434

and https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/2106edaef95ae2262097652454bce1fe720b2b9b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c#L1765

valeriob01 commented 4 years ago

Please see RadeonOpenCompute/ROCm#310 (comment)

and https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/2106edaef95ae2262097652454bce1fe720b2b9b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c#L1765

So there is this location_id field inside the properties file (e.g. /sys/class/kfd/kfd/topology/nodes/0/properties) which is a decimal representation of the PCI location, which can be exploited to discover the position of the GPU, this should make it easier to pinpoint unique_id files to PCI slots. Alternatively you could include the ROCm SMI library (https://github.com/RadeonOpenCompute/rocm_smi_lib) but it is only available with ROCm, if some system has AMDGPU-PRO I guess it will not work. NVIDIA is another story, they have the NVML library, but I read that "Uniqueness and ordering across reboots and system configurations is not guaranteed (i.e. if a Tesla K40c returns 0x100 and the two GPUs on a Tesla K10 in the same system returns 0x200 it is not guaranteed they will always return those values but they will always be different from each other).", see https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1gbf4a80f14b6093ce98e4c2dd511275c5 Using the KFD Topology seems the most reliable method for AMD GPUs, still searching for NVIDIA.

valeriob01 commented 4 years ago

Tentative patch for this issue in clwrap.cpp:

std::string getUUID(int seqId) {

int N = 0;

!!! if card0 has unique_id file then N = 0; else N = 1; !!!

  File f = File::openRead("/sys/class/drm/card"s + std::to_string(seqId + N) + "/device/unique_id");
  std::string uuid = f ? f.readLine() : "";
  if (!uuid.empty() && uuid.back() == '\n') { uuid.pop_back(); }
  return uuid;
}

this is a naive tentative patch !

valeriob01 commented 4 years ago

Yes I'm aware of the problem, I confirm it's a bug. The fix is not easy though. OpenCL does not make it easy to link the "logical" enumarion to the cardN list.

For gpuowl the card list should start at 0 like it does now, just ignore the excluded iGPU cards, nobody wants to use iGPUs for running gpuowl.

valeriob01 commented 4 years ago

I would like you address this issue sooner or later :-)

preda commented 4 years ago

It'll have to wait until after the SP experiments, sorry. I do plan to come back to this at some point, but priorities are mixing in.

On Fri, 23 Oct 2020 at 17:40, valeriob01 notifications@github.com wrote:

I would like you address this issue sooner or later :-)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/preda/gpuowl/issues/167#issuecomment-714979789, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMO6RF2JY4BHPVV45QWW3SMEQMRANCNFSM4OCLNLDQ .

valeriob01 commented 3 years ago

OpenCL News: the new OpenCL Specification 3.0.7 brings a number of useful extensions, one of those is especially useful to resolve this issue: cl_khr_pci_bus_info for exposing PCI bus information of the OpenCL device. Information on this new extension is available on OpenCL documentation at https://github.com/KhronosGroup/OpenCL-Registry/blob/master/specs/3.0-unified/pdf/OpenCL_Ext.pdf

valeriob01 commented 3 years ago

I hope you are not against OpenCL 3.0 (and specifically 3.0.7) only for the fact that it reduces mandatory features, and that your position has a stronger basis.

preda commented 3 years ago

No problem with OpenCL 3, only that I'm not sure it's implemented by ROCm yet. But yes will look in this direction, thanks for the pointers, just a bit busy ATM.

preda commented 8 months ago

This may be partially fixed in 7.5.