Closed zhuango closed 1 month ago
Hi ! I see two issues in your code:
vllm
you can't use the torchrun
launcher because vllm uses a different distribution library called Ray Serve instead of torchrun
. To run distributed inference with vllm you need to use a simple luancher (like process
) and configure the backend parameter tensor_parallel_size
I was able to benchmark the model on a single GPU. You need to pass {"trust_remote_code": True}
in both model_kwargs
and processor_kwargs
and to add enforce_eager=True
to the backend config.
from optimum_benchmark import (
Benchmark,
BenchmarkConfig,
InferenceConfig,
ProcessConfig,
VLLMConfig,
)
from optimum_benchmark.logging_utils import setup_logging
setup_logging(level="INFO")
if __name__ == "__main__":
launcher_config = ProcessConfig()
scenario_config = InferenceConfig(latency=True, memory=True)
backend_config = VLLMConfig(
model="THUDM/glm-4-9b-chat",
task="text-generation",
device="cuda",
device_ids="0",
no_weights=True,
enforce_eager=True,
library="transformers",
model_kwargs={"trust_remote_code": True},
processor_kwargs={"trust_remote_code": True},
)
benchmark_config = BenchmarkConfig(
name="vllm_glm_4",
scenario=scenario_config,
launcher=launcher_config,
backend=backend_config,
)
benchmark_report = Benchmark.launch(benchmark_config)
benchmark = Benchmark(config=benchmark_config, report=benchmark_report)
benchmark.save_json("benchmark.json")
I was runing benchmark examples with vllm backend by the following scripts.
But I got an error:
My optimum-benchmark version is 0.3.1 and vllm version is 0.5.2. Do you guys have any suggestions on this? Thanks.