[RFC] Batch inference using ez_deploy_config

stikkireddy commented 2 months ago

the input should be a delta table with specific schema (for idempotency and recomputing inference if you change model, etc)
the output will be a column called "predictions" (user definable) and additional metadata on engine used
this should support incremental checkpointing of predictions to not cause large delta transactions to be lost due to issue with model or deployment
this should support local gpus

initially we can stick to:

python
focus on endpoints (the bottle neck will be IO as llm calls are going to be longer than any compute we will do where the code is being executed)
making the workload incremental and transactional

ideal interface:

def perform_batch(table, ez_deploy_config, batch_config) -> bool

The batch_config should let you control very few knobs, like (endpoint or local), (ray vs spark vs httpx) (parallelism), checkpointing strategy. For spark we can use streaming but partitions need to be tuned so that you can transact to delta table maybe every 1000 rows. GPUs MUST be saturated with requests (this may be easier to achieve with ray actors.

The output must always be a delta table.

If you need ray refer to this. This is very outdated but may be helpful: https://github.com/stikkireddy/llm-batch-inference/blob/main/01_batch_scoring_single_node.py

stikkireddy commented 2 months ago

using ray + gpu vms works phenomenally

ensure to enable prefix caching
https://github.com/vllm-project/vllm/pull/8252 track this as its needed for most popular use cases
support for multiple multi-modal inputs
enable proper seq len and quantization for larger models

stikkireddy commented 2 months ago

For batch using sglang it requires this: https://github.com/sgl-project/sglang/pull/1127

stikkireddy / mlflow-extensions

[RFC] Batch inference using ez_deploy_config #19