Ship text transformer in WASM

DawChihLiou commented 1 year ago

Using a Rust wrapper of ONNX.

Candidate: ort

visheratin commented 1 year ago

ORT web already has the WASM backend baked in.

DawChihLiou commented 1 year ago

@visheratin thanks for the research! Do you know if running WASM in rust is possible? It'd be great if I could pack transformer in Voy so it doesn't depend on other libraries to create embeddings.

jlarmstrongiv commented 1 year ago

I only know of two popular libraries to handle running wasm via rust:

Though, I don’t know if either of those two libraries will allow you to pack the transformer.

maccman commented 1 year ago

I'm not convinced that packing the transformers in is a good idea. I seems like everyone's going to want to use different libraries and have different needs. A couple of examples probably will suffice.

DawChihLiou commented 1 year ago

I'm not convinced that packing the transformers in is a good idea. I seems like everyone's going to want to use different libraries and have different needs. A couple of examples probably will suffice.

Absolutely. The idea is to pack a language model runtime like onnx so everyone can choose their own models to perform feature extraction. Voy will be a standalone semantic engine with this feature in place. It's worth exploring too because I have a feeling the wasm transformer will be more performant than js.

DawChihLiou commented 1 year ago

onnnxruntime-web performance issue: https://github.com/microsoft/onnxruntime/issues/11181

visheratin commented 1 year ago

From my experience with onnxruntime-web internals, at least part of the issues from this comment are already solved - in WASM mode, the session is created from the model buffer, and all processing run in WASM. Below is the profiler screenshot showing the same point - the processing is end-to-end WASM calls.

Of course, one can create an optimized model that will run much faster than onnxruntime-web, but it will be done at the price of flexibility. GGML is a good example of that. All GGML models run super-fast, but every new architecture has to be implemented almost from scratch using low-level building blocks. If there was a universally good model for embeddings, one could implement it in GGML, and you could bake it into Voy. But there is no such model; in many cases, the users use custom fine-tuned models.

Packing the whole runtime into the binary is also questionable. The beauty of Voy is that it is fast and small. WASM files for ONNX runtime are around 20 MB for each runtime type (SIMD, threaded SIMD, WebGPU JSEP, etc.). If you want to support all client runtimes (e.g., for the longest time, Safari didn't have SIMD support; maybe they have it now, maybe not), you'd have to pack all these files.

I think that making Voy a high-performance embedding storage is the best.

DawChihLiou commented 1 year ago

I agree with you @visheratin @maccman. I'll pivot and focus on the index:)

tantaraio / voy

Ship text transformer in WASM #1