Failed to benchmark model performance with vllm backend

zhuango commented 1 month ago

I was runing benchmark examples with vllm backend by the following scripts.

from optimum_benchmark import Benchmark, BenchmarkConfig, TorchrunConfig, InferenceConfig, PyTorchConfig, VLLMConfig, ProcessConfig
from optimum_benchmark.logging_utils import setup_logging

setup_logging(level="INFO")

if __name__ == "__main__":
    launcher_config = TorchrunConfig(nproc_per_node=2)
    scenario_config = InferenceConfig(latency=True, memory=True)
    backend_config = VLLMConfig(model="glm-4-9b-chat-1m", task='text-generation', device="cuda", device_ids="0,1", no_weights=False,library='transformers', model_kwargs={"trust_remote_code":True})
    benchmark_config = BenchmarkConfig(
        name="vllm_glm_4",
        scenario=scenario_config,
        launcher=launcher_config,
        backend=backend_config,
    )
    benchmark_report = Benchmark.launch(benchmark_config)

    # log the benchmark in terminal
    benchmark_report.log() # or print(benchmark_report)

    # convert artifacts to a dictionary or dataframe
    benchmark_config.to_dict() # or benchmark_config.to_dataframe()

    # save artifacts to disk as json or csv files
    benchmark_report.save_csv("benchmark_report.csv") # or benchmark_report.save_json("benchmark_report.json")

    benchmark = Benchmark(config=benchmark_config, report=benchmark_report)
    benchmark.save_json("benchmark.json") # or benchmark.save_csv("benchmark.csv")

But I got an error:

Traceback (most recent call last):
  File "/home/zijing/llm/benchmark_perf.py", line 18, in <module>
    benchmark_report = Benchmark.launch(benchmark_config)
  File "/home/zijing/.local/lib/python3.10/site-packages/optimum_benchmark/benchmark/base.py", line 47, in launch
    report = launcher.launch(worker=cls.run, worker_args=[config])
  File "/home/zijing/.local/lib/python3.10/site-packages/optimum_benchmark/launchers/torchrun/launcher.py", line 91, in launch
    raise ChildProcessError(output["traceback"])
ChildProcessError: Traceback (most recent call last):
  File "/home/zijing/.local/lib/python3.10/site-packages/optimum_benchmark/launchers/torchrun/launcher.py", line 167, in entrypoint
    report = worker(*worker_args)
  File "/home/zijing/.local/lib/python3.10/site-packages/optimum_benchmark/benchmark/base.py", line 60, in run
    backend: Backend = backend_factory(backend_config)
  File "/home/zijing/.local/lib/python3.10/site-packages/optimum_benchmark/backends/vllm/backend.py", line 31, in __init__
    self.download_pretrained_model()
  File "/home/zijing/.local/lib/python3.10/site-packages/optimum_benchmark/backends/vllm/backend.py", line 44, in download_pretrained_model
    self.automodel_class.from_pretrained(self.config.model, **self.config.model_kwargs)
  File "/home/zijing/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
    return model_class.from_pretrained(
  File "/home/zijing/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3838, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/zijing/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4349, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for ChatGLMForConditionalGeneration:
        While copying the parameter named "transformer.embedding.word_embeddings.weight", whose dimensions in the model are torch.Size([151552, 4096]) and whose dimensions in the checkpoint are torch.Size([151552, 4096]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
        While copying the parameter named "transformer.encoder.layers.0.input_layernorm.weight", whose dimensions in the model are torch.Size([4096]) and whose dimensions in the checkpoint are torch.Size([4096]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
        While copying the parameter named "transformer.encoder.layers.0.self_attention.query_key_value.weight", whose dimensions in the model are torch.Size([5120, 4096]) and whose dimensions in the checkpoint are torch.Size([5120, 4096]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
        While copying the parameter named "transformer.encoder.layers.0.self_attention.query_key_value.bias", whose dimensions in the model are torch.Size([5120]) and whose dimensions in the checkpoint are torch.Size([5120]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
        While copying the parameter named "transformer.encoder.layers.0.self_attention.dense.weight", whose dimensions in the model are torch.Size([4096, 4096]) and whose dimensions in the checkpoint are torch.Size([4096, 4096]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
        While copying the parameter named "transformer.encoder.layers.0.post_attention_layernorm.weight", whose dimensions in the model are torch.Size([4096]) and whose dimensions in the checkpoint are torch.Size([4096]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
        While copying the parameter named "transformer.encoder.layers.0.mlp.dense_h_to_4h.weight", whose dimensions in the model are torch.Size([27392, 4096]) and whose dimensions in the checkpoint are torch.Size([27392, 4096]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
        While copying the parameter named "transformer.encoder.layers.0.mlp.dense_4h_to_h.weight", whose dimensions in the model are torch.Size([4096, 13696]) and whose dimensions in the checkpoint are torch.Size([4096, 13696]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
...

My optimum-benchmark version is 0.3.1 and vllm version is 0.5.2. Do you guys have any suggestions on this? Thanks.

IlyasMoutawwakil commented 1 month ago

Hi ! I see two issues in your code:

For vllm you can't use the torchrun launcher because vllm uses a different distribution library called Ray Serve instead of torchrun. To run distributed inference with vllm you need to use a simple luancher (like process) and configure the backend parameter tensor_parallel_size
You're using a model with custom remote code in vllm, I don't know how often that goes well (because vllm has its own modeling and can't do distributed inference with custom code). It seems to be a very problematic setting in vllm itself. See https://github.com/vllm-project/vllm/issues/5306.

IlyasMoutawwakil commented 1 month ago

I was able to benchmark the model on a single GPU. You need to pass {"trust_remote_code": True} in both model_kwargs and processor_kwargs and to add enforce_eager=True to the backend config.

from optimum_benchmark import (
    Benchmark,
    BenchmarkConfig,
    InferenceConfig,
    ProcessConfig,
    VLLMConfig,
)
from optimum_benchmark.logging_utils import setup_logging

setup_logging(level="INFO")

if __name__ == "__main__":
    launcher_config = ProcessConfig()
    scenario_config = InferenceConfig(latency=True, memory=True)
    backend_config = VLLMConfig(
        model="THUDM/glm-4-9b-chat",
        task="text-generation",
        device="cuda",
        device_ids="0",
        no_weights=True,
        enforce_eager=True,
        library="transformers",
        model_kwargs={"trust_remote_code": True},
        processor_kwargs={"trust_remote_code": True},
    )
    benchmark_config = BenchmarkConfig(
        name="vllm_glm_4",
        scenario=scenario_config,
        launcher=launcher_config,
        backend=backend_config,
    )
    benchmark_report = Benchmark.launch(benchmark_config)
    benchmark = Benchmark(config=benchmark_config, report=benchmark_report)
    benchmark.save_json("benchmark.json")

huggingface / optimum-benchmark

Failed to benchmark model performance with vllm backend #229