There are descrepancies of how things like cuda graph, flash attn work between hardware and being able to do gpu or cloud specific configs in EzDeployConfig is gonna be important.
COHERE_FOR_AYA_23_35B = EzDeployConfig(
name="cohere_aya_35b",
engine_proc=_ENGINE,
engine_config=_ENGINE_CONFIG(
model="CohereForAI/aya-23-35B",
guided_decoding_backend="outlines",
vllm_command_flags={
"--max-num-seqs": 128,
"--gpu-memory-utilization": 0.98,
"--distributed-executor-backend": "ray",
},
),
gpu_specific_config= {...}
cloud_specific_config={...}
serving_config=ServingConfig(
# rough estimate for the engines this includes model weights + kv cache + overhead + intermediate states
minimum_memory_in_gb=76,
),
)
There are descrepancies of how things like cuda graph, flash attn work between hardware and being able to do gpu or cloud specific configs in EzDeployConfig is gonna be important.