vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.44k stars 4.22k forks source link

amazon/FalconLite2 #7040

Open sadrafh opened 2 months ago

sadrafh commented 2 months ago

I want to use vllm and the model amazon/FalconLite2 which can be found https://huggingface.co/amazon/FalconLite2 for benchmarking throughput and latencies. However, the model is not supported by vllm. What should I do to make this run? Thanks

mgoin commented 2 months ago

@sadrafh I was able to get this model to load with a small fix to the config:

diff --git a/vllm/transformers_utils/configs/falcon.py b/vllm/transformers_utils/configs/falcon.py
index c82cc606..43eb0438 100644
--- a/vllm/transformers_utils/configs/falcon.py
+++ b/vllm/transformers_utils/configs/falcon.py
@@ -69,6 +69,7 @@ class RWConfig(PretrainedConfig):
         self.bias = bias
         self.parallel_attn = parallel_attn
         self.new_decoder_architecture = new_decoder_architecture
+        self.num_ln_in_parallel_attn = None

         if self.hidden_size == 8192:
             # Hack for falcon-40b

However it seems to produce gibberish:

>>> from vllm import LLM
>>> model = LLM("amazon/FalconLite2", trust_remote_code=True, quantization="gptq")
>>> model.generate("<|prompter|>What are the main challenges to support a long context for LLM?<|endoftext|><|assistant|>")
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.64it/s, est. speed input: 62.00 toks/s, output: 58.35 toks/s]
[RequestOutput(request_id=0, prompt='<|prompter|>What are the main challenges to support a long context for LLM?<|endoftext|><|assistant|>', prompt_token_ids=[65028, 1562, 362, 248, 1316, 4922, 271, 1164, 241, 916, 4436, 312, 31370, 56, 42, 11, 65027], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Ern Solar approachedThomas kost lol igual Alarm creatorvenue交Verified Debor soybean circulatecaster', token_ids=(53224, 15197, 15377, 24174, 33011, 12929, 26976, 41954, 17087, 4255, 20098, 48622, 37688, 52315, 54369, 30414), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1722542626.8144124, last_token_time=1722542626.8144124, first_scheduled_time=1722542626.8371897, first_token_time=1722542626.9011965, time_in_queue=0.022777318954467773, finished_time=1722542627.1092346), lora_request=None)]
sadrafh commented 2 months ago

Thank for spending the time to help me. Could you please elaborate a little bit more about your modifications? And do you know how we can fix the results?

specifically I did the modifications but got the below error:

ubuntu@sadra-test-a10:~/vllm/benchmarks$ python3 -u benchmark_latency.py --model /home/ubuntu/FalconLite2/ --batch-size 1 --input-len 128 --output-len 128 --num-iters-warmup 2 --num-iters 4 --quantization 'gptq' --trust-remote-code Namespace(model='/home/ubuntu/FalconLite2/', tokenizer=None, quantization='gptq', tensor_parallel_size=1, input_len=128, output_len=128, batch_size=1, n=1, use_beam_search=False, num_iters_warmup=2, num_iters=4, trust_remote_code=True, dtype='auto', enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, profile=False, profile_result_dir=None, device='cuda', block_size=16, enable_chunked_prefill=False, ray_workers_use_nsight=False, download_dir=None) You are using a model of type RefinedWeb to instantiate a model of type falcon. This is not supported for all configurations of models and can yield errors. --- Logging error --- Traceback (most recent call last): File "/usr/lib/python3.10/logging/init.py", line 1100, in emit msg = self.format(record) File "/usr/lib/python3.10/logging/init.py", line 943, in format return fmt.format(record) File "/usr/local/lib/python3.10/dist-packages/vllm/logging/formatter.py", line 11, in format msg = logging.Formatter.format(self, record) File "/usr/lib/python3.10/logging/init.py", line 678, in format record.message = record.getMessage() File "/usr/lib/python3.10/logging/init.py", line 368, in getMessage msg = msg % self.args TypeError: %d format: a real number is required, not list Call stack: File "/home/ubuntu/vllm/benchmarks/benchmark_latency.py", line 197, in main(args) File "/home/ubuntu/vllm/benchmarks/benchmark_latency.py", line 20, in main llm = LLM(model=args.model, File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 144, in init self.llm_engine = LLMEngine.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 335, in from_engine_args engine_config = engine_args.create_engine_config() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 559, in create_engine_config model_config = ModelConfig( File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 133, in init self.max_model_len = _get_and_verify_max_len( File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 1208, in _get_and_verify_max_len logger.warning( Message: "The model's config.json does not contain any of the following keys to determine the original maximum length of the model: %d. Assuming the model's maximum length is %d." Arguments: (['max_position_embeddings', 'n_positions', 'max_seq_len', 'seq_length', 'model_max_length', 'max_sequence_length', 'max_seq_length', 'seq_len'], 2048) WARNING 08-01 20:38:23 config.py:213] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 08-01 20:38:23 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/home/ubuntu/FalconLite2/', speculative_config=None, tokenizer='/home/ubuntu/FalconLite2/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/ubuntu/FalconLite2/) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. rank0: Traceback (most recent call last): rank0: File "/home/ubuntu/vllm/benchmarks/benchmark_latency.py", line 197, in

rank0: File "/home/ubuntu/vllm/benchmarks/benchmark_latency.py", line 20, in main rank0: llm = LLM(model=args.model, rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 144, in init rank0: self.llm_engine = LLMEngine.from_engine_args( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 359, in from_engine_args rank0: engine = cls( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 222, in init rank0: self.model_executor = executor_class( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in init

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 121, in load_model

rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 134, in load_model rank0: self.model = get_model( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model rank0: return loader.load_model(model_config=model_config, rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model rank0: model = _initialize_model(model_config, self.load_config, rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model rank0: return model_class(config=model_config.hf_config, rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/falcon.py", line 389, in init rank0: self.transformer = FalconModel(config, cache_config, quant_config) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/falcon.py", line 350, in init rank0: self.h = nn.ModuleList(rank00]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/falcon.py", line 351, in rank0: FalconDecoderLayer(config, cache_config, quant_config) rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/falcon.py", line 249, in init rank0: if (config.num_ln_in_parallel_attn is None rank0: File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 264, in getattribute rank0: return super().getattribute(key)

################################ I I changed the modifications here: /vllm/vllm/transformers_utils/configs$ falcon.py

self.bias = bias self.parallel_attn = parallel_attn self.new_decoder_architecture = new_decoder_architecture self.num_ln_in_parallel_attn = None

    if self.hidden_size == 8192:
        # Hack for falcon-40b