Open AaronFriel opened 1 month ago
I wonder if the streaming protocol of WASI helps here - instead of using a futex using IPC with efficient reading and writing to shared circular buffers?
I guess regardless how wasm code communicates with host, the host has to organize the threads/processes for each sequence. In AICI this is done with separate processes which means you can kill the process and limit execution time without overhead (if you put a time limit in wasmtime it seems to have significant overhead; I don't remember exactly but I recall 30% or so in the inner bias computation loop).
The processes can also fork, though how needed that is for AICI functionality is another question.
Threads would be potentially easier to synchronize (though one has to be careful with things like rayon - they seem to have issues when you have more than 50 cores or so).
I did some cursory research into the libraries available for cross-process IPC in Rust, and, well, essentially all of them add some significant degree of complexity to the implementation or aren't as latency optimized (e.g.: using eventfd
with the ring buffer to signal).
What considerations went into using aicirt
as a Rust binary as opposed to a library? I can guess one of those would be integration with vLLM.
I ask because I'm wondering if might make more sense to export a library interface to it to allow the host to decide how to handle concurrency, and e.g.: to oxidize it as a Python library for integration with vLLM using PyO3?
Indeed, exposing aicirt as a library might be better from python standpoint. However, there is still internal concurrency within aicirt - that is running multiple sequence controllers in parallel. I think you don't want to expose that to Python, as it would have performance implications. Either way, that can be handled as processes (as done now) or processes.
From #84:
Originally posted by @mmoskal in https://github.com/microsoft/aici/issues/84#issuecomment-2342293184