Feature: Runtime Kernel Fusion

ctcyang commented 5 years ago

If you have a DAG of binary operations, you can traverse it in some topological order and generate proper bitcodes for your GPU kernel. OmniSci does it on parsed SQL queries, and more specifically different filter combinations, etc. TensorFlow does it based on the equation that needs to be minimized for gradient descent. Can we do it for GraphBLAS?

If you want to keep your kernel code as simple as possible, with minimal branches, etc., then there's no way doing it at compile-time. Unless you know what you're going to solve at compile time for SQL queries and tensor flow optimizations, you don't know about the exact details of the queries/equations at compile-time.

Both TensorFlow and OmniSci do real-time code generation using LLVM.

aydinbuluc commented 5 years ago

Great question and we might be able to write a proposal to do this.

Could we do it in compile time?

On Thu, May 2, 2019 at 12:29 PM Carl Yang notifications@github.com wrote:

If you have a DAG of binary operations, you can traverse it in some topological order and generate proper bitcodes for your GPU kernel. OmniSci does it on parsed SQL queries, and more specifically different filter combinations, etc. TensorFlow does it based on the equation that needs to be minimized for gradient descent. Can we do it for GraphBLAS?

If you want to keep your kernel code as simple as possible, with minimal branches, etc., then there's no way doing it at compile-time. Unless you know what you're going to solve at compile time for SQL queries and tensor flow optimizations, you don't know about the exact details of the queries/equations at compile-time.

Both TensorFlow and OmniSci do real-time code generation using LLVM.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gunrock/graphblast/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMJ7L77LS34CJNFZ72JFKDPTM6LLANCNFSM4HKFMKUA .

ctcyang commented 5 years ago

Two things are unknown at compile-time: 1) If the user wants to run a few iterations of a for-loop, the amount of sparsity in the data (which affects which kernel is more optimal and should be chosen) after Iteration X is unknown at compile time. So the kernel fusion could vary each iteration of the loop in a data-dependent way 2) I'm inclined towards a shared library approach rather than header-only, in order to support a Python frontend that can be used with Jupyter notebook. The consequence is that the computation graph which depends on which ops the user wants is unknown at compile time (i.e. when the shared library is compiled).

gunrock / graphblast

Feature: Runtime Kernel Fusion #3