arpieb commented 5 years ago

Opening for a thread of discussion - and willing to make the changes if interested. I'm testing the cl module on the following systems:

Linux system running a pair of Nvidia Titan Xp GPUs with a 12-core Intel CPU + Nvidia's OpenCL drivers installed
MacPro with an AMD Radeon RX580 with 2x12-core Intel CPUs + macOS 10.14.3 with native Apple OpenCL drivers installed
MacBook Pro with 2xAMD GPUs, 4-core Intel CPU

While the Macs are executing OpenCL functions in sub-ms times, the Nvidia drivers are relatively slow when performing allocations of memory objects against the GPU, in some cases taking on the order of 100's of ms to complete (while actual operations on said allocated objects are damn snappy).

Since "dirty schedulers" were introduced ~Erlang 17.3 and compiled in as of ERTS 9.0, I was wondering if you'd be open to updating the NIF exports to execute all the cl NIF functions on the dirty schedulers if the ERTS being compiled for supports them.

BTW, really incredible undertaking here - before I found your repo I was building out my own OpenCL NIF so much respect for the fact you completed a full implementation!

tonyrog commented 5 years ago

If I remember correctly there is a bit of scheduling overhead involved with dirty nifs? What about a selective approach for nifs that are problematic? ( Or are the majority of the allocation causing problems? ) Possibly using enif_system_info to get the dirty_scheduler_support and then some compile time macro to check if we can actually call enif_schedule_nif?

arpieb commented 5 years ago

Yeah, there is some overhead, but it's measured in nanoseconds on modern hardware best I can tell. The "yielding NIF" approach is probably not going to be very tractable as most of the NIF functions map pretty much 1:1 to the OpenCL functions with the exception of unpacking terms.

I'll build some test cases where I can isolate and time the individual functions on thousands of runs on all three boxes and report back. There was a great presentation a couple years ago at ElixirConf US where they were performing timing studies on different NIF-handling approaches, maybe I can find the test scaffolding for that somewhere.

It's possible that it's only a subset of problem children, and also that it might be an OpenCL driver-vendor issue. After all, my MacPro is Mid-2010 vintage and I can't imagine that the bus between RAM and the GPU on it is that much faster than an i7-7800X with server-speed RAM and Nvidia Pascal GPUs...

tonyrog commented 5 years ago

I could wrap the nif table entries with something like: //-------------------------------

if (ERL_NIF_MAJOR_VERSION > 2) || ((ERL_NIF_MAJOR_VERSION == 2) && (ERL_NIF_MINOR_VERSION >= 12))

//#define NIF_FUNC(name,arity,fptr) {(name),(arity),(fptr),(ERL_NIF_DIRTY_JOB_CPU_BOUND)}

define NIF_FUNC(name,arity,fptr) {(name),(arity),(fptr),(0)}

elif (ERL_NIF_MAJOR_VERSION > 2) || ((ERL_NIF_MAJOR_VERSION == 2) && (ERL_NIF_MINOR_VERSION >= 7))

define NIF_FUNC(name,arity,fptr) {(name),(arity),(fptr),(0)}

else

define NIF_FUNC(name,arity,fptr) {(name),(arity),(fptr)}

endif

//-------------------------------

This way it would be fairly easy to switch to an all dirty nif approach, if it turns out that the overhead is ok. Or at least allow switch to dirty nif for any one that wants to compile using -DUSE_DIRTY_SCHEDULER flag ? Perhaps even have a NIF_DIRTY_FUNC entry that is backward compatible?

tonyrog commented 5 years ago

And just to clarify. My idea with using enif_schedule_nif was not meant to break up a the nif in several pieces, but rather a way to dynamically decide when to run a nif on a dirty secheduler. The idea is to have one entry point ( like: cl:create_image/5 ) in the NIF say ecl_create_image_dyn you can check parameters to see if you want to call create_image/5 indirectly by using enif_scheduler_nif with ERL_NIF_DIRTY_JOB_CPU_BOUND flag or just call ecl_crate_image directly.

tonyrog commented 5 years ago

I prepared the nif table so you can switch between dirty and non dirty. Also added example cl:noop_/0 which is dynamic dirty and and cl:dirty_noop/0 that is always dirty (if supported). You can find a small simple benchmark in test/cl_noop that check the call overhead.

arpieb commented 5 years ago

Nice, thanks! Once I finish up the 1.2 wrappers, docs and unit tests I'll take a swing at this.

arpieb commented 5 years ago

OK, just forked your latest to play around with dirty scheduler support and timings. So far I've made one tiny change to c_src/Makefile to allow USE_DIRTY_SCHEDULER to be set from the environment when being included as a dependency:

ifeq ($(USE_DIRTY_SCHEDULER), 1)
  $(info Compiling with support for dirty schedulers)
  CFLAGS += -DUSE_DIRTY_SCHEDULER
endif

I'll keep you posted on what I find out. It might wind up being a compile directive that will be enabled only for certain projects that know they are going to spend a lot of time in OpenCL calls...

tonyrog / cl

Update code to take advantage of "dirty" schedulers #36

if (ERL_NIF_MAJOR_VERSION > 2) || ((ERL_NIF_MAJOR_VERSION == 2) && (ERL_NIF_MINOR_VERSION >= 12))

define NIF_FUNC(name,arity,fptr) {(name),(arity),(fptr),(0)}

elif (ERL_NIF_MAJOR_VERSION > 2) || ((ERL_NIF_MAJOR_VERSION == 2) && (ERL_NIF_MINOR_VERSION >= 7))

define NIF_FUNC(name,arity,fptr) {(name),(arity),(fptr),(0)}

else

define NIF_FUNC(name,arity,fptr) {(name),(arity),(fptr)}

endif