Closed DawChihLiou closed 1 year ago
ORT web already has the WASM backend baked in.
@visheratin thanks for the research! Do you know if running WASM in rust is possible? It'd be great if I could pack transformer in Voy so it doesn't depend on other libraries to create embeddings.
I'm not convinced that packing the transformers in is a good idea. I seems like everyone's going to want to use different libraries and have different needs. A couple of examples probably will suffice.
I'm not convinced that packing the transformers in is a good idea. I seems like everyone's going to want to use different libraries and have different needs. A couple of examples probably will suffice.
Absolutely. The idea is to pack a language model runtime like onnx so everyone can choose their own models to perform feature extraction. Voy will be a standalone semantic engine with this feature in place. It's worth exploring too because I have a feeling the wasm transformer will be more performant than js.
onnnxruntime-web performance issue: https://github.com/microsoft/onnxruntime/issues/11181
From my experience with onnxruntime-web internals, at least part of the issues from this comment are already solved - in WASM mode, the session is created from the model buffer, and all processing run in WASM. Below is the profiler screenshot showing the same point - the processing is end-to-end WASM calls.
Of course, one can create an optimized model that will run much faster than onnxruntime-web, but it will be done at the price of flexibility. GGML is a good example of that. All GGML models run super-fast, but every new architecture has to be implemented almost from scratch using low-level building blocks. If there was a universally good model for embeddings, one could implement it in GGML, and you could bake it into Voy. But there is no such model; in many cases, the users use custom fine-tuned models.
Packing the whole runtime into the binary is also questionable. The beauty of Voy is that it is fast and small. WASM files for ONNX runtime are around 20 MB for each runtime type (SIMD, threaded SIMD, WebGPU JSEP, etc.). If you want to support all client runtimes (e.g., for the longest time, Safari didn't have SIMD support; maybe they have it now, maybe not), you'd have to pack all these files.
I think that making Voy a high-performance embedding storage is the best.
I agree with you @visheratin @maccman. I'll pivot and focus on the index:)
Using a Rust wrapper of ONNX.