The batch_config should let you control very few knobs, like (endpoint or local), (ray vs spark vs httpx) (parallelism), checkpointing strategy. For spark we can use streaming but partitions need to be tuned so that you can transact to delta table maybe every 1000 rows. GPUs MUST be saturated with requests (this may be easier to achieve with ray actors.
initially we can stick to:
ideal interface:
def perform_batch(table, ez_deploy_config, batch_config) -> bool
The batch_config should let you control very few knobs, like (endpoint or local), (ray vs spark vs httpx) (parallelism), checkpointing strategy. For spark we can use streaming but partitions need to be tuned so that you can transact to delta table maybe every 1000 rows. GPUs MUST be saturated with requests (this may be easier to achieve with ray actors.
The output must always be a delta table.
If you need ray refer to this. This is very outdated but may be helpful: https://github.com/stikkireddy/llm-batch-inference/blob/main/01_batch_scoring_single_node.py