Open mildbyte opened 2 years ago
Passing native primitive datatypes (i32
, i64
, f32
, f64
) from the host (wasmtime
for seafowl) and receiving a native primitive result is straight forward. More complex data types such as string, structs, not so much.
In the long run, WebAssembly Interface Types (WIT
) promise to provide an elegant solution to the problem of passing complex data between webassembly functions written in various high-level languages and the host. WIT includes an IDL, also called "wit" which can be used for code generation.
For example, below is the WIT description of a function which converts an input string to uppercase and returns the result:
upper: func(s: string) -> string
WIT-generated calling code, in our case run by the seafowl process.
#[allow(clippy::all)]
mod input {
pub fn upper(s: & str,) -> String{
unsafe {
let vec0 = s;
let ptr0 = vec0.as_ptr() as i32;
let len0 = vec0.len() as i32;
#[repr(align(4))]
struct __InputRetArea([u8; 8]);
let mut __input_ret_area: __InputRetArea = __InputRetArea([0; 8]);
let ptr1 = __input_ret_area.0.as_mut_ptr() as i32;
#[link(wasm_import_module = "input")]
extern "C" {
#[cfg_attr(target_arch = "wasm32", link_name = "upper: func(s: string) -> string")]
#[cfg_attr(not(target_arch = "wasm32"), link_name = "input_upper: func(s: string) -> string")]
fn wit_import(_: i32, _: i32, _: i32, );
}
wit_import(ptr0, len0, ptr1);
let len2 = *((ptr1 + 4) as *const i32) as usize;
String::from_utf8(Vec::from_raw_parts(*((ptr1 + 0) as *const i32) as *mut _, len2, len2)).unwrap()
}
}
}
WIT-generated wrapper around guest code (in our case the UDF).
#[allow(clippy::all)]
mod input {
#[export_name = "upper: func(s: string) -> string"]
unsafe extern "C" fn __wit_bindgen_input_upper(arg0: i32, arg1: i32, ) -> i32{
let len0 = arg1 as usize;
let result1 = <super::Input as Input>::upper(String::from_utf8(Vec::from_raw_parts(arg0 as *mut _, len0, len0)).unwrap());
let ptr2 = __INPUT_RET_AREA.0.as_mut_ptr() as i32;
let vec3 = (result1.into_bytes()).into_boxed_slice();
let ptr3 = vec3.as_ptr() as i32;
let len3 = vec3.len() as i32;
core::mem::forget(vec3);
*((ptr2 + 4) as *mut i32) = len3;
*((ptr2 + 0) as *mut i32) = ptr3;
ptr2
}
#[export_name = "cabi_post_upper"]
unsafe extern "C" fn __wit_bindgen_input_upper_post_return(arg0: i32, ) {
wit_bindgen_guest_rust::rt::dealloc(*((arg0 + 0) as *const i32), (*((arg0 + 4) as *const i32)) as usize, 1);
}
#[repr(align(4))]
struct __InputRetArea([u8; 8]);
static mut __INPUT_RET_AREA: __InputRetArea = __InputRetArea([0; 8]);
pub trait Input {
fn upper(s: String,) -> String;
}
}
There exists a very early pre-alpha WIT implementation for rust supporting both rust hosts and WASM guests. The developers urge everyone interested in using this in production to hold their horses and look for other alternatives while the WIT standard is finalized, I'd guess somewhere between 12 - 18 months from now.
The least ambitious, but by no means easiest approach is to extend the existing integer and float types currently supported in seafowl UDFs with strings. Not only would this provide support for using CHAR, TEXT, VARCHAR types in UDFs, more complex data structures could be submitted as serialized strings using JSON, MessagePack, CBOR, etc.
I wrote example proof of concept upper()
function based on this excellent blogpost. Both the code invoking the WASM function, and that of the upper()
function itself are fairly complex.
The complexity stems from the following:
malloc()
-ing guest memory, copying the results back to host memory, and free()
-ing input and output buffers. The result buffer must be allocated by the guest (since the size of the response isn't necessarily known), but must be freed by the host (since it must read the result before deallocating the result).C
strings are just raw pointers terminated with \0
. Pascal-style strings are prepended with their length in bytes, generally considered a better design these days. Naively returning a (length, pointer)
would require passing multiple values, which isn't possible, but receiving and passing a pointer to the i32
-encoded string length followed by the string itself is possible (this is what the WIT-generated code above does).If strings aren't necessary UTF-8 string, but rather MessagePack-encoded streams of values, then all of the function arguments could be encoded in a single string, resulting in a simplified UDF WASM function signature:
fn(len: u32, ptr: u32) -> u32
Where the result is a pointer to a pascal-style string like in the WIT-generated code.
The waPC project attempts to simplify wasm host-guest RPC. They provide a rust host and a number of supported guest languages. WaPC has its own GraphQL-inspired IDL language (WIDL). Based on GitHub activity, it seems to be an active project but lacks significant backing (written and mostly by 3 guys at a startup called Vino until recently). Links to step-by-step tutorials are all broken. WaPC uses MessagePack to serialize data by default.
As a name that kept coming up during my research, wasm-bindgen deserves a mention. Its a mature solution for WASM RPC, but unfortunately limited to JavaScript host -> Rust WASM module guest calls. There was experimental support for WIT, but its not longer supported. In a future where WIT support returns, wasm-bindgen
could be an ergonomic route to UDFs with complex inputs / outputs. Currently the guide on using it with rust hosts does not work as advertised.
The WebAssembly System Interface is an extension to WASM providing an interface to module functions for interacting with the host filesystem, command line arguments, environment variables, etc.
Like most things WASM-related, WASI itself is still in it's infancy and subject to change (the compiled wasm links to wasi_snapshot_preview1
). Still, unlike WIT, WASI is already used in production and using it doesn't require a PhD in compiler design. Based on this blog post I implemented a version of upper()
which gets its input from environment variables and prints the result to stdout. The env vars and standard output aren't the actual env vars and stdout of the host process, they're what seafowl passes as such to wasmtime. In other words, it's a convenient was to pass state to the WASM module function without having to deal with all the malloc
and free
choreography of the first solution. How much overhead this solution incurs compared to the first solution, I don't know yet.
Everyone -including myself- looks upon WIT as the "ultimate" solution to WASM RPC. Unfortunately, when WIT stabilizes is anyone's guess. The good news is that we don't have to commit to a single UDF interface for all time.
Seafowl already expects a language
field in its UDF function creation statement, which could be used to distinguish between calling conventions.
If the overhead of using WASI is acceptable, reading serialized input from stdin
and writing serialized output to stdout
seems like a more ergonomic approach than requiring users creating UDFs to implement by hand code similar to what WIT generates. We could even allow error messages to be sent to stderr
.
For "normal" UDFs, the input consists of a tuple of supported arrow types, so the serialized input could look something like this:
| i32: total bytes | messpack-encoded vector of arrow types | messagepack stream of serialized values |
Currently, our WASM functions only support passing basic types like ints and floats. In order to be able to pass something more complex like strings or datetimes, we want to put them in the WASM memory and point the UDF to it.
We need to figure out what is the most ergonomic way to the function writer to do this. For reference, something like this:
compiles to:
This should work out of the box, without having to write a wrapper that converts some binary representation into a C string.