ufo-kit / ufo-core

GLib-based framework for GPU-based data processing
GNU Lesser General Public License v3.0
24 stars 8 forks source link

Task inputs limited to 16 #168

Closed MarcusZuber closed 2 years ago

MarcusZuber commented 4 years ago

Here https://github.com/ufo-kit/ufo-core/blob/master/ufo/ufo-scheduler.c#L324 the number of inputs per task is limited to 16.

Is there a reason for it, or could this be increased?

tfarago commented 4 years ago

No idea, I guess we should just try.

MarcusZuber commented 4 years ago

Ok, I will test it tomorrow. Should I remove the upper limit completely or just set it to something slightly larger?

The maximum level could be dependent on opencl (and maybe even on the hardware).

MarcusZuber commented 4 years ago

What I found out so far: The opencl-filter is the only node, where we can have arbitrary inputs. CL_DEVICE_MAX_PARAMETER_SIZE gives the memory available for kernel inputs. This is for all devices at least 256. We pass only pointers (8bytes each), this resultis in 32 parameters (including the 1 result).

The 2080ti has 4352 bytes which is already over 500 inputs.

tfarago commented 4 years ago

That is one thing, another is that every task runs in a separate thread and if you have 500 inputs you are going to grill your machine.

matze commented 4 years ago

Is there a reason for it, or could this be increased?

It's completely arbitrary and could be increased.

CL_DEVICE_MAX_PARAMETER_SIZE gives the memory available for kernel inputs. This is for all devices at least 256. We pass only pointers (8bytes each), this resultis in 32 parameters (including the 1 result).

The opencl task passes only buffers, yes, but other tasks also use constant data and image textures which have device-specific limitations. But I suppose that using CL_DEVICE_MAX_PARAMETER_SIZE is still a reasonable heuristic for the number of inputs.

That is one thing, another is that every task runs in a separate thread and if you have 500 inputs you are going to grill your machine.

I don't think so because most threads will idle either waiting for new data or kernels to finish. Wasting memory on 500 per-thread stacks is another story though, although with virtual memory even that does not matter ;-)

MarcusZuber commented 4 years ago

I played in this pull-request a little bit and now it works with 64 input nodes (the branch still needs heavy clean up). Since the prenece of edegs in the graph is encoded in integers incresing it over 64 needs some more work.

tfarago commented 4 years ago

I don't think so because most threads will idle either waiting for new data or kernels to finish. Wasting memory on 500 per-thread stacks is another story though, although with virtual memory even that does not matter ;-)

Actually that's not what I see. Unfortunately, even when "idling", htop tells me UFO is occupying a lot of system resources.

matze commented 4 years ago

Since the prenece of edegs in the graph is encoded in integers incresing it over 64 needs some more work.

The edge label contains the input number as is, i.e. some 64-bit value and should not be limiting. Only the connection test uses a bitmap which you could replace with an array of booleans or whatever.

Actually that's not what I see. Unfortunately, even when "idling", htop tells me UFO is occupying a lot of system resources.

Well on the one hand that's a good sign. But I still doubt that the majority of CPU time is burned in each thread doing simple book keeping. I guess that either comes from the OpenCL implementation or the GPU driver itself.

tfarago commented 4 years ago

I fear it's the GPU driver, hence we can't do anything about it...

MarcusZuber commented 4 years ago

The edge label contains the input number as is, i.e. some 64-bit value and should not be limiting. Only the connection test uses a bitmap which you could replace with an array of booleans or whatever.

I will make it more flexible when I finish it.

How would I write a proper test? All the implemented filters are in ufo-filters, which I would need to test it.