Closed valeriob01 closed 8 months ago
seems like in gpuowl output GPU[1] is 1, but GPU[2] is 0.
listing of cardx/device , they differ, and card0 lacks unique_id file, because they are numbered differently, look here, this system has 2 Radeon VII::
`/sys/class/drm# ls card0/device ari_enabled class current_link_speed device driver_override firmware_node i2c-2 irq local_cpulist max_link_width msi_irqs remove resource resource2_wc rom subsystem_vendor boot_vga config current_link_width dma_mask_bits drm graphics i2c-3 label local_cpus modalias numa_node rescan resource0 resource4 subsystem uevent
/sys/class/drm# ls card1/device aer_dev_correctable config dma_mask_bits hwmon local_cpulist mem_info_vis_vram_total msi_irqs pp_cur_state pp_features product_name resource0_wc serial_number vendor aer_dev_fatal consistent_dma_mask_bits driver i2c-10 local_cpus mem_info_vis_vram_used numa_node pp_dpm_dcefclk pp_force_state product_number resource2 subsystem aer_dev_nonfatal current_link_speed driver_override i2c-4 max_link_speed mem_info_vram_total pcie_bw pp_dpm_fclk pp_mclk_od remove resource2_wc subsystem_device ari_enabled current_link_width drm i2c-6 max_link_width mem_info_vram_used pcie_replay_count pp_dpm_mclk pp_num_states rescan resource4 subsystem_vendor boot_vga d3cold_allowed enable i2c-8 mem_busy_percent mem_info_vram_vendor power pp_dpm_pcie pp_power_profile_mode reset resource5 uevent broken_parity_status device fw_version irq mem_info_gtt_total modalias power_dpm_force_performance_level pp_dpm_sclk pp_sclk_od resource revision unique_id class df_cntr_avail gpu_busy_percent link mem_info_gtt_used msi_bus power_dpm_state pp_dpm_socclk pp_table resource0 rom vbios_version
/sys/class/drm# ls card2/device aer_dev_correctable config dma_mask_bits hwmon local_cpulist mem_info_vis_vram_total msi_irqs pp_cur_state pp_features product_name resource0_wc serial_number vendor aer_dev_fatal consistent_dma_mask_bits driver i2c-12 local_cpus mem_info_vis_vram_used numa_node pp_dpm_dcefclk pp_force_state product_number resource2 subsystem aer_dev_nonfatal current_link_speed driver_override i2c-14 max_link_speed mem_info_vram_total pcie_bw pp_dpm_fclk pp_mclk_od remove resource2_wc subsystem_device ari_enabled current_link_width drm i2c-16 max_link_width mem_info_vram_used pcie_replay_count pp_dpm_mclk pp_num_states rescan resource4 subsystem_vendor boot_vga d3cold_allowed enable i2c-18 mem_busy_percent mem_info_vram_vendor power pp_dpm_pcie pp_power_profile_mode reset resource5 uevent broken_parity_status device fw_version irq mem_info_gtt_total modalias power_dpm_force_performance_level pp_dpm_sclk pp_sclk_od resource revision unique_id class df_cntr_avail gpu_busy_percent link mem_info_gtt_used msi_bus power_dpm_state pp_dpm_socclk pp_table resource0 rom vbios_version
`
/sys/class/drm# ls
amdttm card0 card0-DP-1 card0-HDMI-A-1 card1 card1-DP-2 card1-DP-3 card1-DP-4 card1-HDMI-A-2 card2 card2-DP-5 card2-DP-6 card2-DP-7 card2-HDMI-A-3 renderD128 renderD129 renderD130 version
there is card0, card1, card2:
root@sel:/sys/class/drm# cat card0/device/unique_id
cat: card0/device/unique_id: No such file or directory
root@sel:/sys/class/drm# cat card1/device/unique_id
592c190172fd5d40
root@sel:/sys/class/drm# cat card2/device/unique_id
c6f220c172dc76bb
Thus, to read correctly all the UUIDs you must start at 1 not at 0.
seems like in gpuowl output GPU[1] is 1, but GPU[2] is 0.
no, you are just reading card0 which lacks unique_id and is not a discrete GPU.
Please see https://github.com/preda/gpuowl/pull/170
rocm-smi output:
root@sel:/home/sel/test/gpuowl# /opt/rocm-3.3.0/bin/rocm-smi --showuniqueid
========================ROCm System Management Interface========================
================================================================================
GPU[1] : Unique ID: 592c190172fd5d40
GPU[2] : Unique ID: c6f220c172dc76bb
================================================================================
==============================End of ROCm SMI Log ==============================
gpuowl output:
-device <N> : select a specific device:
0 592c190172fd5d40 : gfx906+sram-ecc-Vega 20 [Radeon VII] AMD
1 c6f220c172dc76bb : gfx906+sram-ecc-Vega 20 [Radeon VII] AMD
Yes I'm aware of the problem, I confirm it's a bug. The fix is not easy though. OpenCL does not make it easy to link the "logical" enumarion to the cardN list.
Theoretically it would suffice to try to open card0/device/unique_id, if it doesn't exists then card0 is a an iGPU, and we should skip it. Please take my fix as a smoke-test for this concept.
I think the fix will work on every mainboard with integrated graphics.
Even better, for each board cardN, first test for existence of cardN/device/unique_id file, if it doesn't exist skip the board, with high probability it is an iGPU.
iGPUs are not the only ones that don't have unique_id. In fact only a subset of GPUs under ROCm have unique_id. As a clear example, Nvidia GPUs most likely don't expose unique_id in the same way ROCm does.
Yes, the ROCm documentation states clearly that the unique_id file is available only for AMD GFX9, even AMD GFX8 lacks unique_id. Thus, first test for AMD, then test for GFX9 (maybe just test for "Radeon VII" string?), then for unique_id file...
Yes, the ROCm documentation states clearly that the unique_id file is available only for AMD GFX9, even AMD GFX8 lacks unique_id. Thus, first test for AMD, then test for GFX9 (maybe just test for "Radeon VII" string?), then for unique_id file...
Well, at this point it is sufficient to test for AMD and GFX9.
Please see RadeonOpenCompute/ROCm#310 (comment)
So there is this location_id field inside the properties file (e.g. /sys/class/kfd/kfd/topology/nodes/0/properties) which is a decimal representation of the PCI location, which can be exploited to discover the position of the GPU, this should make it easier to pinpoint unique_id files to PCI slots. Alternatively you could include the ROCm SMI library (https://github.com/RadeonOpenCompute/rocm_smi_lib) but it is only available with ROCm, if some system has AMDGPU-PRO I guess it will not work. NVIDIA is another story, they have the NVML library, but I read that "Uniqueness and ordering across reboots and system configurations is not guaranteed (i.e. if a Tesla K40c returns 0x100 and the two GPUs on a Tesla K10 in the same system returns 0x200 it is not guaranteed they will always return those values but they will always be different from each other).", see https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1gbf4a80f14b6093ce98e4c2dd511275c5 Using the KFD Topology seems the most reliable method for AMD GPUs, still searching for NVIDIA.
Tentative patch for this issue in clwrap.cpp:
std::string getUUID(int seqId) {
int N = 0;
!!! if card0 has unique_id file then N = 0; else N = 1; !!!
File f = File::openRead("/sys/class/drm/card"s + std::to_string(seqId + N) + "/device/unique_id");
std::string uuid = f ? f.readLine() : "";
if (!uuid.empty() && uuid.back() == '\n') { uuid.pop_back(); }
return uuid;
}
this is a naive tentative patch !
Yes I'm aware of the problem, I confirm it's a bug. The fix is not easy though. OpenCL does not make it easy to link the "logical" enumarion to the cardN list.
For gpuowl the card list should start at 0 like it does now, just ignore the excluded iGPU cards, nobody wants to use iGPUs for running gpuowl.
I would like you address this issue sooner or later :-)
It'll have to wait until after the SP experiments, sorry. I do plan to come back to this at some point, but priorities are mixing in.
On Fri, 23 Oct 2020 at 17:40, valeriob01 notifications@github.com wrote:
I would like you address this issue sooner or later :-)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/preda/gpuowl/issues/167#issuecomment-714979789, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFMO6RF2JY4BHPVV45QWW3SMEQMRANCNFSM4OCLNLDQ .
OpenCL News: the new OpenCL Specification 3.0.7 brings a number of useful extensions, one of those is especially useful to resolve this issue: cl_khr_pci_bus_info
for exposing PCI bus information of the OpenCL device.
Information on this new extension is available on OpenCL documentation at https://github.com/KhronosGroup/OpenCL-Registry/blob/master/specs/3.0-unified/pdf/OpenCL_Ext.pdf
I hope you are not against OpenCL 3.0 (and specifically 3.0.7) only for the fact that it reduces mandatory features, and that your position has a stronger basis.
No problem with OpenCL 3, only that I'm not sure it's implemented by ROCm yet. But yes will look in this direction, thanks for the pointers, just a bit busy ATM.
This may be partially fixed in 7.5.
What rocm-smi outputs:
What gpuOwl reads: