A framework for writing performant user-defined functions (UDFs) that are portable across a variety of engines including Apache Spark, Apache Hive, and Presto.
BSD 2-Clause "Simplified" License
291
stars
72
forks
source link
Transport-Trino: Manage StdUDF state using instance factory #118
Currently, UDF state in Trino's StdUdfWrapper is initialized in the specialize() method, and is updated in eval() on certain conditions. State initialization in specialize() is not reliable since specialize() result can be cached across multiple UDF invocations, and hence one invocation can use the initialized state from another, leading to issues like query contamination. This patch moves away from manipulating state through the specialize() method in Trino UDFs, and instead uses a State class to keep track of state (in an object conventionally called instance factory). A key property of the State class is that is constructor is parameterless. To enable State class to be parameterless while having it contain a reference to the enclosing StdUDF (see the patch for why the reference is needed), we resort to code generation to create a custom State class for each StdUDF, along with the expected StdUDF reference. All state manipulation now moves to the eval() function.
Currently, UDF state in Trino's
StdUdfWrapper
is initialized in thespecialize()
method, and is updated ineval()
on certain conditions. State initialization inspecialize()
is not reliable sincespecialize()
result can be cached across multiple UDF invocations, and hence one invocation can use the initialized state from another, leading to issues like query contamination. This patch moves away from manipulating state through thespecialize()
method in Trino UDFs, and instead uses aState
class to keep track of state (in an object conventionally called instance factory). A key property of theState
class is that is constructor is parameterless. To enableState
class to be parameterless while having it contain a reference to the enclosingStdUDF
(see the patch for why the reference is needed), we resort to code generation to create a customState
class for eachStdUDF
, along with the expectedStdUDF
reference. All state manipulation now moves to theeval()
function.