[SYCL] [SPEC] USM Feedback

Ruyk commented 5 years ago

Some comments, questions and feedback about the SYCL USM proposal, very interesting work!

1. Use context instead of device?

All malloc functions target a specific device in the system. However, allocations on SYCL/OpenCL are bound to a context, rather than a specific device. For example, In some platforms, two devices sharing an OpenCL context can also share memory allocations. It is also possible for the same device to have different context's with different allocations on them.

From a SYCL implementation perspective, for the SYCL Runtime to be able to track USM allocations (at least to be aware of their existence and enable conversion to sycl::buffers). It would be useful to understand to which SYCL context a given allocation belongs to.

This will simplify SYCL implementations "emulating" USM behavior using existing OpenCL buffers (e.g. like we do with the virtual pointer utility in the SDK), because existing allocations can be implemented alongside traditional cl_mem objects

2. Use namespace instead of `sycl_` prefix

Is there any particular reason for all functions being pre-fixed with sycl_ ? Seems more natural to use a namespace (sycl::allocate?) in C++. It can still be used as a replacement of any allocation function in the same way, but allows for C++ users to write more generic code.

3. Default selection of device*

The allocation function forms that take no device as parameter is said to "use the device selected by the default selector on success".

However, Note that there is no guarantee on the SYCL specification that a default device selector will chose every time the same device: This means that two consecutive sycl_malloc may put data on different devices.

4. sycl_memcpy and sycl_memset interface*

4.1 Is there any particular reason syclmemcpy on the handler (a) needs the `syclprefix and (b) cannot be done simply by overloading the existingcopy` method?

4.2. There is no other explicit operation in the SYCL queue, so the direct sycl_memcpy queue operation seems odd. It makes sense for the in-order queue to have it there, but this is adding additional functionality to the SYCL queue for USM that is not matched with buffers at this point.

5. sycl_mem_advise` advice parameter

Is there any particular reason why the device-defined advice for the specified allocation is of type int? Can it not be a template type so implementations can use whatever they prefer? Ideally from my point of view, this should be using the SYCL properties mechanism so we have a more extensible (and coherent) interface for specifying custom behaviours on the API.

6. `get_pointer_info` query:

memory::allocation_type returns the type of allocation, but would it be possible to return more information? In particular, in which device/context a given pointer has been allocated can be useful to track allocations on different devices that do not necessarily can share them

7. Table 1, USM device information descriptors

If I understand correctly, the device descriptors additional properties that can be queried from the get_info method of the device class. If so:

7.1 Wording of the table indicates "adds a requirement" but this are info queries. Seems wording in general should change to indicate that (I can do PR if this helps)

7.2 What is the expected value of "info::memory::shared_granularity" for non-shared allocations? should this raise an error or be 0?

7.3 info::memory::valid_shared_devices returns a vector of device objects that can access a shared allocation. Shouldn't this devices be part of the same context anyway? If that is the case, why not simply return a context? See point 8 for details.

8. Multiple devices and USM

Its not clear to me at this point how multiple device allocations will work, when allocations are possible and when they can migrate across devices. Seems to me it will be clearer if USM is associated with SYCL context objects rather than individual devices. A SYCL context can encapsulate one or multiple devices, so all devices in a given context will share the same USM allocations. This doesn't affect the simple interfaces for sycl_malloc that don't take a device, and will only require one extra step from users to create a context before using the interfaces currently taking a device.

The info queries for "info::memory::valid_shared_devices" are now unnecessary, since, by definition, all devices on the SYCL context will be able to share the allocation.

A SYCL user that wants to ensure the allocations and the queue are using the same underlying resources, can keep the context alive and use it to create the queue, e.g:

// Platform with two devices that can share allocations
// Normal context creation in SYCL, context associated with the default device(s)
sycl::context myContext{default_device_selector()};
// Allocation bound to all devices on said context
void * myPointer = sycl_malloc(sizeof(float)*1024, myContext);
// SYCL queue created on the same context,
sycl::queue myQueue(myContext, myContext.get_devices()[0]);
sycl::queue myQueue(myContext, myContext.get_devices()[1]);

This has the associated benefit the context can be used to track USM allocations, which simplifies some operations such as keeping track of used memory from the SYCL runtime.

9. Conversions between USM pointers and Buffers

9.1 Why a new use_usm_pointer property and not use the existing use_host_ptr one? 9.2 host_no_access is something we implemented as a vendor extension in https://github.com/codeplaysoftware/standards-proposals/blob/master/host_access/sycl-1.2.1/host_access.md which may give some more flexibility

10. SYCL scheduling - DAGs

10.1 The example interface uses sycl_malloc with a template parameter which is not described in the sections above. I rather prefer that format than the different malloc functions :-)

11. Kernel capturing pointers:

The assumption here is that USM refers to allocations in what OpenCL would call "global" memory space.

11.1 SYCL 1.2.1 Section 6.3 restrictions on kernels states that:

[...] Structures containing pointers may be shared but the value of any pointer passed between SYCL devices or between the host and a SYCL device is undefined

However, when USM is available, pointers captured by lambdas are USM pointers - not undefined. This is a significant change on SYCL applications which particularly affects library developers: Even when they don't write their kernels to support USM, the code may be compiled with USM support. This means that pointers that were meaningless and ignored before now are expected to map to global address space.

11.2 In SYCL 1.2.1 Section 6.8:

[...] If no other rule above can be applied to a declaration of a pointer, then it is assumed to be in the private address space. This default assumption is expected to change to be the generic address space for OpenCL versions that support the generic address space.

In the case, the pointers captured by the kernel lambda will be pointers to global memory, rather than private. This changes the address-space deduction rules, which can change what routines get called further down the line (e.g. calling global specializations vs private ones).

If USM is relying on generic pointer support this is less of a problem, but we have then to be a bit careful with the potential fragmentation of the ecosystem: Some kernels will be written from USM, and need to be compiled with such support enabled, and some others will not.

11.3 Is USM a feature that is known at compile time? (e.g. compilation flag)

12. USM to Multi pointer:

12.1 Is it possible to convert a USM pointer to a multi-pointer? do you plan to offer an interface for that?

jbrodman commented 5 years ago

1) We'll have to think more on this. Things like device allocations especially are really meant to be only accessible on the device they're allocated on, which doesn't map perfectly to the context.

2) This is possible. However, I do think there is an advantage in having a C API for these allocation routines - it makes it easier to integrate into other C-based programming solutions. C++ wrappers could easily be built on top of a C API.

3) I really don't like that the spec doesn't have this guarantee as I don't think it matches what a user would expect to happen. I would expect to see a Spec issue opened about this (if there isn't already...).

4.1) No - there is no good reason we can't just do this other than maybe consistency with other approaches.

4.2) Purely a shortcut for enqueueing a memcpy on an OoO queue that doesn't have any dependences. This is not an uncommon case.

5) Consistency with Linux APIs and other programming models. I get where you're coming from, but I'm not sure it adds a lot of value for this particular method.

6) Yes, I've received similar feedback from other sources. At a minimum, it should also return what device the pointer was allocated against, which make it easier to say, launch a kernel on that device.

7.1) Appreciated.

7.2) If the device does not support shared allocations, this should probably be zero or an error. Willing to be sold on which option is better.

7.3) I'd only be comfortable with this if the context ONLY contained devices for which this is a true statement. Not immediately clear how to enforce this.

8) Yeah - you're not being unreasonable here. My only concern about going to the context approach is device allocations. I suppose it's probably ok to be clear about the behavior if you do something naughty. The proposal doesn't really say much about this, but the intent is that there would be optional P2P capabilities that would enable devices to read other devices' memories, which would fit with the context idea.

9.1) This is probably fine. 9.2) Cool.

10) D'oh - artifact of older draft. The template argument was dropped to avoid ever possibly seeing template<> sycl_malloc<...>(...). My eyes! My eyes!

11.1) Correct. 11.2) I think generic pointers (as an impl detail perhaps) is really the only way to go for this. This is one of the reasons all the USM stuff is opt-in and not positioned to be required by the spec. If you use it, you'd better support it. 11.3) I think this would be easier, but a more proper answer probably depends on a more formal extension mechanism. Probably worth defining a macro that the device compiler could set to specialize code.

12) Umm - I don't see why not. Hadn't thought about it tbh - happy to accept suggestions.

keryell commented 5 years ago

After thinking to it for a while, SYCL is based on modern C++ and I feel that it does not fit well the purpose to have a full extension just written in plain old C at the first place. I would prefer a good modern C++ API that looks like SYCL and then build a C, Fortran, Python, Cobol, APL... API on top of that, even if it just reusing the C API.

jeffhammond commented 5 years ago

I agree with both parties who would like a proper C++ API and not the C-style symbol naming. If C++ users wants C-style names, they can use (template) function aliasing.

bd4 commented 4 years ago

As a new user of SYCL+USM, for a mixed Fortran/CUDA/C++ fusion code (genecode.org) and a C++ multi-dimensional array library (https://github.com/wdmapp/gtensor), I'm fine with a C++ API. I likely have to maintain a small C layer anyway to gracefully handle CUDA/HIP/SYCL from Fortran parts of the code. Compared to other challenges, it's just a trivial bit of extra code. The most important thing is having the functionality available. Seems like the other initial users of USM (Kokkos and RAJA?) are already C++, so not an issue for them either.

One issue I am still working through for gtensor (the multi-d array lib) is how to handle the queue object. CUDA is a stateful library, in that it keeps track of a default context and you can set the device, and all the functions use it without having to pass it around. SYCL on the other hand requires you to keep and pass the queue object, particularly in the USM malloc/free calls. gtensor is a header only library, so I'm not sure yet how to handle the queue. I think there are clever ways to handle it, I don't think it's a blocker for us, but maybe something to consider if many other users are running up against it. Perhaps in the form of an example, rather than actually changing any spec.

keryell commented 4 years ago

@bd4 maintaining a state is difficult at scale. This is why SYCL, Vulkan, OpenCL... try to avoid it nowadays since high number of CPU cores and accelerators is the norm. That said, I can hear you. If you want to contribute a nice SYCL wrapper extension provided some thread_local state because it simplifies programming in some cases, I am pretty sure the SYCL committee will look at it. :-) Same if you have some great ideas about having a wrapper layer in C, even if it is not clear how to deal with the single-source aspects which makes the strength of SYCL... In the meantime, you can have some thread_local global variables in your header-only library. C++17 inline static initialization helps a lot for header-only libraries.

intel / llvm