Open sadrafh opened 2 months ago
@sadrafh I was able to get this model to load with a small fix to the config:
diff --git a/vllm/transformers_utils/configs/falcon.py b/vllm/transformers_utils/configs/falcon.py
index c82cc606..43eb0438 100644
--- a/vllm/transformers_utils/configs/falcon.py
+++ b/vllm/transformers_utils/configs/falcon.py
@@ -69,6 +69,7 @@ class RWConfig(PretrainedConfig):
self.bias = bias
self.parallel_attn = parallel_attn
self.new_decoder_architecture = new_decoder_architecture
+ self.num_ln_in_parallel_attn = None
if self.hidden_size == 8192:
# Hack for falcon-40b
However it seems to produce gibberish:
>>> from vllm import LLM
>>> model = LLM("amazon/FalconLite2", trust_remote_code=True, quantization="gptq")
>>> model.generate("<|prompter|>What are the main challenges to support a long context for LLM?<|endoftext|><|assistant|>")
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.64it/s, est. speed input: 62.00 toks/s, output: 58.35 toks/s]
[RequestOutput(request_id=0, prompt='<|prompter|>What are the main challenges to support a long context for LLM?<|endoftext|><|assistant|>', prompt_token_ids=[65028, 1562, 362, 248, 1316, 4922, 271, 1164, 241, 916, 4436, 312, 31370, 56, 42, 11, 65027], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Ern Solar approachedThomas kost lol igual Alarm creatorvenue交Verified Debor soybean circulatecaster', token_ids=(53224, 15197, 15377, 24174, 33011, 12929, 26976, 41954, 17087, 4255, 20098, 48622, 37688, 52315, 54369, 30414), cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1722542626.8144124, last_token_time=1722542626.8144124, first_scheduled_time=1722542626.8371897, first_token_time=1722542626.9011965, time_in_queue=0.022777318954467773, finished_time=1722542627.1092346), lora_request=None)]
Thank for spending the time to help me. Could you please elaborate a little bit more about your modifications? And do you know how we can fix the results?
specifically I did the modifications but got the below error:
ubuntu@sadra-test-a10:~/vllm/benchmarks$ python3 -u benchmark_latency.py --model /home/ubuntu/FalconLite2/ --batch-size 1 --input-len 128 --output-len 128 --num-iters-warmup 2 --num-iters 4 --quantization 'gptq' --trust-remote-code
Namespace(model='/home/ubuntu/FalconLite2/', tokenizer=None, quantization='gptq', tensor_parallel_size=1, input_len=128, output_len=128, batch_size=1, n=1, use_beam_search=False, num_iters_warmup=2, num_iters=4, trust_remote_code=True, dtype='auto', enforce_eager=False, kv_cache_dtype='auto', quantization_param_path=None, profile=False, profile_result_dir=None, device='cuda', block_size=16, enable_chunked_prefill=False, ray_workers_use_nsight=False, download_dir=None)
You are using a model of type RefinedWeb to instantiate a model of type falcon. This is not supported for all configurations of models and can yield errors.
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.10/logging/init.py", line 1100, in emit
msg = self.format(record)
File "/usr/lib/python3.10/logging/init.py", line 943, in format
return fmt.format(record)
File "/usr/local/lib/python3.10/dist-packages/vllm/logging/formatter.py", line 11, in format
msg = logging.Formatter.format(self, record)
File "/usr/lib/python3.10/logging/init.py", line 678, in format
record.message = record.getMessage()
File "/usr/lib/python3.10/logging/init.py", line 368, in getMessage
msg = msg % self.args
TypeError: %d format: a real number is required, not list
Call stack:
File "/home/ubuntu/vllm/benchmarks/benchmark_latency.py", line 197, in
rank0: File "/home/ubuntu/vllm/benchmarks/benchmark_latency.py", line 20, in main rank0: llm = LLM(model=args.model, rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 144, in init rank0: self.llm_engine = LLMEngine.from_engine_args( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 359, in from_engine_args rank0: engine = cls( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 222, in init rank0: self.model_executor = executor_class( rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in init
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 121, in load_model
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 134, in load_model
rank0: self.model = get_model(
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
rank0: return loader.load_model(model_config=model_config,
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model
rank0: model = _initialize_model(model_config, self.load_config,
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model
rank0: return model_class(config=model_config.hf_config,
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/falcon.py", line 389, in init
rank0: self.transformer = FalconModel(config, cache_config, quant_config)
rank0: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/falcon.py", line 350, in init
rank0: self.h = nn.ModuleList(rank00]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/falcon.py", line 351, in
################################ I I changed the modifications here: /vllm/vllm/transformers_utils/configs$ falcon.py
self.bias = bias self.parallel_attn = parallel_attn self.new_decoder_architecture = new_decoder_architecture self.num_ln_in_parallel_attn = None
if self.hidden_size == 8192:
# Hack for falcon-40b
I want to use vllm and the model amazon/FalconLite2 which can be found https://huggingface.co/amazon/FalconLite2 for benchmarking throughput and latencies. However, the model is not supported by vllm. What should I do to make this run? Thanks