Platforms for identical devices do not compare equal

fknorr commented 5 months ago

Describe the bug

To split work in a multi-GPU setting, we need to find sets of equal / compatible GPUs on a system.

On a system with 4x Nvidia RTX 3090, sycl::platform::get_platforms() returns four distinct platforms that stringify to sycl::platform(vendor="NVIDIA Corporation", name="NVIDIA CUDA BACKEND"), but compare unequal with operator== and do not produce the same hash. I would expect all devices to share a platform in this case.

Comparing the backend for finding a set of equal GPUs is not enough either, since DPC++ produces the same backend enumerator at least for the Intel(R) OpenCL and Intel(R) FPGA Emulation Platform for OpenCL(TM) platforms which clearly do not originate from a multi-GPU situation.

To reproduce

int main() {
    std::unordered_map<std::string, std::vector<sycl::platform>> platforms_by_string;
    for(const auto& pf : sycl::platform::get_platforms()) {
        const auto string = "vendor=" + pf.get_info<sycl::info::platform::vendor>()
                + " name=" + pf.get_info<sycl::info::platform::name>()
                + " version=" + pf.get_info<sycl::info::platform::version>();
        platforms_by_string[string].push_back(pf);
    }
    for(const auto& [string, pfs] : platforms_by_string) {
        printf("%zu platforms matching %s\n", pfs.size(), string.c_str());
        for(size_t i = 0; i < pfs.size(); ++i) {
            for(size_t j = 0; j < pfs.size(); ++j) {
                assert(std::hash<sycl::platform>{}(pfs[i]) == std::hash<sycl::platform>{}(pfs[j]));
                assert(pfs[i] == pfs[j]);
            }
        }
    }
}

Environment

OS: Ubuntu 22.04
DPC++ e3308557 (May 7, 2024)
CUDA 12.3

Additional context

No response

JackAKirk commented 5 months ago

This will be fixed when https://github.com/oneapi-src/unified-runtime/pull/1565 is merged. We are working on this as a priority.

JackAKirk commented 4 months ago

The ur changes have now been pulled in here: https://github.com/intel/llvm/pull/14030 Closing this issue.

intel / llvm