Scaling performance as sequences exceeds core count

microsoft / aici

AICI: Prompts as (Wasm) Programs

MIT License

1.93k stars 78 forks source link

Scaling performance as sequences exceeds core count #115

Open AaronFriel opened 1 month ago

AaronFriel commented 1 month ago

From #84:

Just as a general heads up - the problem with run into with AICI in production is the case where there are more sequences in batch (and thus parallel controller processes) than cores. This is because I spin for a while on futexes (to minimize latency), and this kills performance when we're out of cores. This would need to be fixed somehow. The latency minimization was mostly there when we still had post/pre_process(); for mid_process() it shouldn't matter that much.

Originally posted by @mmoskal in https://github.com/microsoft/aici/issues/84#issuecomment-2342293184

mmoskal commented 1 month ago

I wonder if the streaming protocol of WASI helps here - instead of using a futex using IPC with efficient reading and writing to shared circular buffers?

I guess regardless how wasm code communicates with host, the host has to organize the threads/processes for each sequence. In AICI this is done with separate processes which means you can kill the process and limit execution time without overhead (if you put a time limit in wasmtime it seems to have significant overhead; I don't remember exactly but I recall 30% or so in the inner bias computation loop).

The processes can also fork, though how needed that is for AICI functionality is another question.

Threads would be potentially easier to synchronize (though one has to be careful with things like rayon - they seem to have issues when you have more than 50 cores or so).

AaronFriel commented 1 month ago

I did some cursory research into the libraries available for cross-process IPC in Rust, and, well, essentially all of them add some significant degree of complexity to the implementation or aren't as latency optimized (e.g.: using eventfd with the ring buffer to signal).

What considerations went into using aicirt as a Rust binary as opposed to a library? I can guess one of those would be integration with vLLM.

I ask because I'm wondering if might make more sense to export a library interface to it to allow the host to decide how to handle concurrency, and e.g.: to oxidize it as a Python library for integration with vLLM using PyO3?

mmoskal commented 1 month ago

Indeed, exposing aicirt as a library might be better from python standpoint. However, there is still internal concurrency within aicirt - that is running multiple sequence controllers in parallel. I think you don't want to expose that to Python, as it would have performance implications. Either way, that can be handled as processes (as done now) or processes.