oneapi-src / oneDPL

oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html
Apache License 2.0
716 stars 113 forks source link

Work around the global policy issue in a backward-compatible manner #1618

Closed akukanov closed 1 month ago

akukanov commented 1 month ago

This patch proposes a different approach to address the issue #1060.

Previously we tried to re-implement the device_policy class in a way that would avoid eager initialization of a SYCL queue in the constructor, and instead would do it on the first use. That cannot be done however without having layout changes for the class, which we prefer to avoid for sake of preserving backward compatibility.

However, the issue #1060 does not really require lazy initialization; it asks for absence of exceptions or any other runtime errors in case a global SYCL queue object cannot be initialized. We use SYCL's default selector to create a queue used by the dpcpp_default predefined policy, which is supposed to always return a valid device because a SYCL implementation must provide at least one valid device. So our current implementation is formally correct, while the problem is observed in a strange environment where SYCL cannot find/initialize any device, even a CPU device; that seems to be a territory of undefined behavior.

Therefore a workaround can be implemented that detects during the construction of dpcpp_default that there are no SYCL devices and does not even try to initialize the queue, instead leaving the policy in an unusable state. According to the SYCL specification, the set of available platforms and devices does not change during the program lifetime, therefore our device policies, including dpcpp_default, cannot be used anyway (again, it's practically a UB territory) - so just silently ignoring the issue at global policy construction and letting it fire on the actual use seems OK.

The patch here attempts to implement this approach without breaking the class layout. For that, use of sycl::queue in the policies is replaced with a specially crafter __queue_holder, which has the same size and alignment as sycl::queue and uses "placement new" to construct a real queue in the space taken by the holder. In most cases, the queue is constructed unconditionally, but a special constructor for predefined policy instances checks if there is at least one device available. As an indication that the construction was omitted, the first sizeof(void*) bytes of the holder are nullified; the rationale is that a SYCL queue is likely implemented as a shared_ptr (that's certainly the case for DPC++), which in turn typically holds several pointers to the actual object, a service block or a reference counter, etc. It is highly unlikely therefore that the first bytes in a properly constructed queue object will be equivalent to a null pointer. That validity check, however, is only used in the destructor; all other methods assume that the queue was properly constructed, which is asserted in the __queue_ref method that returns a reference to the queue kept in the holder.

danhoeflinger commented 1 month ago

I ran a test similar to the one described here https://github.com/oneapi-src/oneDPL/pull/1154/files#r1317837347 using this branch. The test is a check that we can include oneDPL headers without crashing when no device is available.

On origin/main, when running the test, it passes when a sycl device is available, but crashes when ONEAPI_DEVICE_SELECTOR='!*:*' is specified with PI_ERROR_DEVICE_NOT_FOUND.

On this branch, the test passes with no error with or without ONEAPI_DEVICE_SELECTOR='!*:*'. It seems this branch resolves the issue successfully, if we are satisfied with the approach and if it succeeds in preserving the layout.

akukanov commented 1 month ago

In order to address #1631, the approach proposed in the patch is not enough and the predefined policies should never initialize SYCL queues before the first use. I think it might still be doable in backward-compatible manner, but it makes sense to create a new PR for that; see #1652.