Implement lazy linking of UDF/UDTFs

pearu commented 4 years ago

Lazily link udfs/udtfs so that we don’t force all kernels to CPU if one UDTF or UDF cannot run on GPU

pearu commented 4 years ago

https://github.com/omnisci/omniscidb-internal/blob/master/QueryEngine/NativeCodegen.cpp#L2195 https://github.com/omnisci/omniscidb-internal/blob/master/QueryEngine/ExtensionsIR.cpp#L224

pearu commented 3 years ago

This issue is related to C++ UDF/UDTFs and not to RBC UDF/UDTFs. The task is about making sure that those C++ UDF/UDTFs that can be executed only on CPU will not force all UDF/UDTFs to be executed on CPU because all C++ UDF/UDTFs will end up in the CPU specific LLVM module (?).

pearu commented 3 years ago

Reproducer:

Define a runtime UDF test function:

def test_simple_udf(omnisci):

@omnisci('int32(int32)')
def simple_udf(x):
    return x + 1

query = 'select simple_udf(123)'
descr, result = omnisci.sql_execute(query)
result = list(result)

Define load-time C++ UDF (sample_udf.cpp):


#include <cstdint>
#define EXTENSION_NOINLINE extern "C" NEVER_INLINE DEVICE

EXTENSION_NOINLINE int32_t udf_diff(const int32_t x, const int32_t y) { return x - y; }

3. Start server and run runtime UDF test:

$ bin/omnisci_server --enable-runtime-udf --enable-table-functions compileWorkUnit#2398: udf_cpu_module null compileWorkUnit#2399: udf_gpu_module null compileWorkUnit#2400: rt_udf_cpu_module defines: simple_udfcpu_0, compileWorkUnit#2401: rt_udf_gpu_module defines: simple_udf__gpu_0, generateNativeGPUCode#974: module defines: multifrag_query_hoisted_literals, simple_udfgpu_0, query_group_by_template, agg_id_shared, record_error_code, get_scan_output_slot, row_func_hoisted_literals, filter_func_hoisted_literals,

which indicates that `simple_udf` gpu implementation is used.
4. Start server with loadtime UDF and run runtime UDF test:

$ bin/omnisci_server --enable-runtime-udf --enable-table-functions --udf sample_udf.cpp compileUdf#413: udf_filename="sample_udf.cpp" error: cannot find libdevice for sm_75. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice. compileWorkUnit#2398: udf_cpu_module defines: udf_diff, compileWorkUnit#2399: udf_gpu_module defines: udf_diff, compileWorkUnit#2400: rt_udf_cpu_module defines: simple_udfcpu_0, compileWorkUnit#2401: rt_udf_gpu_module defines: simple_udf__gpu_0, compileWorkUnit#2398: udf_cpu_module defines: udf_diff, compileWorkUnit#2399: udf_gpu_module defines: udf_diff, compileWorkUnit#2400: rt_udf_cpu_module defines: simple_udfcpu_0, compileWorkUnit#2401: rt_udf_gpu_module defines: simple_udf__gpu_0, generateNativeCPUCode#364: module defines: agg_id, record_error_code, get_scan_output_slot, multifrag_query_hoisted_literals, simple_udf__cpu_0, query_group_by_template, row_func_hoisted_literals, filter_func_hoisted_literals,


which indicates that `simple_udf` cpu implementation is forced.

pearu commented 3 years ago

The "cannot find libdevice" error is explained in https://stackoverflow.com/questions/59826961/fail-to-link-cuda-example-with-clang-9-under-ubuntu-18-04. Solution: use clang 11 (I was using clang 9).

pearu commented 3 years ago

This issue requires a test of the SQL HAVING clause as an example that triggers multiple steps of query executions.

pearu commented 3 years ago

Actually, any composite test would trigger multiple steps of query executions. For instance,

select bar(out0) from table(foo(cursor(select x from mytable)))

that involves three execution steps:

select x from mytable
select ... from table(foo(cursor(...)))
select bar(out0) from ...

and the aim is to ensure that steps 1 and 3 are executed on GPU when 2 is running on CPU.

heavyai / rbc

Implement lazy linking of UDF/UDTFs #186