linkedin / transport

A framework for writing performant user-defined functions (UDFs) that are portable across a variety of engines including Apache Spark, Apache Hive, and Presto.
BSD 2-Clause "Simplified" License
291 stars 72 forks source link

Transport-Trino: Manage StdUDF state using instance factory #118

Open wmoustafa opened 1 year ago

wmoustafa commented 1 year ago

Currently, UDF state in Trino's StdUdfWrapper is initialized in the specialize() method, and is updated in eval() on certain conditions. State initialization in specialize() is not reliable since specialize() result can be cached across multiple UDF invocations, and hence one invocation can use the initialized state from another, leading to issues like query contamination. This patch moves away from manipulating state through the specialize() method in Trino UDFs, and instead uses a State class to keep track of state (in an object conventionally called instance factory). A key property of the State class is that is constructor is parameterless. To enable State class to be parameterless while having it contain a reference to the enclosing StdUDF (see the patch for why the reference is needed), we resort to code generation to create a custom State class for each StdUDF, along with the expected StdUDF reference. All state manipulation now moves to the eval() function.