Open KrisWongz opened 11 months ago
can support the Alibaba open-source Qwen model will be wonderful
Hey @KrisWongz @felixstander, #103 should add support for Qwen. The base model appears to generate results consistent with the example on Huggingface Hub. Do you have an adapter I can use to test that the adapter loading works as expected?
Note that you'll need to run with --trust-remote-code
when launching LoRAX as the tokenizer is custom and hosted on HF.
Note that you'll need to run with
--trust-remote-code
when launching LoRAX as the tokenizer is custom and hosted on HF.
I pulled the latest docker and set --trust-remote-code on startup. Startup code:
sudo docker run --gpus all \ --shm-size 10g \ -p 8081:80 \ -v /home/shaohongen/Temp/Models/Qwen:/data ghcr.io/predibase/lorax:latest \ --model-id /data/Tongyi-Finance-14B-Chat \ --trust-remote-code \
But it still reports an error:
2023-12-06T06:11:56.409578Z INFO lorax_launcher: Args { model_id: "/data/Tongyi-Finance-14B-Chat", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "4d4c7a004768", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2023-12-06T06:11:56.409692Z WARN lorax_launcher: trust_remote_code
is set. Trusting that model /data/Tongyi-Finance-14B-Chat
do not contain malicious code.
2023-12-06T06:11:56.409982Z INFO download: lorax_launcher: Starting download process.
2023-12-06T06:12:04.991708Z WARN lorax_launcher: cli.py:143 No safetensors weights found for model /data/Tongyi-Finance-14B-Chat at revision None. Converting PyTorch weights to safetensors.
2023-12-06T06:12:25.164707Z ERROR download: lorax_launcher: Download encountered an error: Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in
File "/opt/conda/lib/python3.9/site-packages/lorax_server/cli.py", line 199, in download_weights _download_weights(model_id, revision, extension, auto_convert, source)
File "/opt/conda/lib/python3.9/site-packages/lorax_server/cli.py", line 173, in _download_weights utils.convert_files(local_pt_files, local_st_files, discard_names)
File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/convert.py", line 112, in convert_files convert_file(pt_file, sf_file, discard_names)
File "/opt/conda/lib/python3.9/site-packages/lorax_server/utils/convert.py", line 71, in convert_file loaded = torch.load(pt_file, map_location="cpu")
File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 809, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1172, in _load result = unpickler.load()
File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1142, in persistent_load typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/opt/conda/lib/python3.9/site-packages/torch/serialization.py", line 1112, in load_tensor storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
File "/opt/conda/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 579, in _raise_timeout_error raise ValueError(
ValueError: Loading this model requires you to execute custom code contained in the model repository on your local machine. Please set the option trust_remote_code=True
to permit loading of this model.
Error: DownloadError
Hey @KrisWongz @felixstander, #103 should add support for Qwen. The base model appears to generate results consistent with the example on Huggingface Hub. Do you have an adapter I can use to test that the adapter loading works as expected?
Really appreciate your work! Haven't tested yet, but I will upload couple of my fine-tuned adapters to huggingface hub for your guys to test soon.
Thanks @felixstander!
@KrisWongz it looks like the model weights .bin
file is trying to execute some code on deserialization. I wasn't able to repro this using the base model from Huggingface here: https://huggingface.co/jxy/Tongyi-Finance-14B-Chat.
This is one of the issues with pickle, though, which is it can do unpredictable things like this. Can you try converting the weights to safetensors format and trying again?
Does Lorax support Qwen-4bit-gptq version without the need of flash attention v2?
As far as I can see, All models you support now are built on top of flash attention by default. Unfortunately, some of our GPUs for inference are still V100, which is not liked by flash attn.:(
Currently we rely on flash attention, but we can definitely explore alternatives, like falling back to paged attention during prefill if needed.
@tgaddair I'm testing with Qwen-14-gptq-int4 on RTX3090 right now, My launch parameters are as follow: lorax-launcher --model-id /root/autodl-tmp/Qwen-14B-Chat-Int4 --quantize gptq --trust-remote-code --port 6006 But I got the following error: 2023-12-07T06:58:35.659747Z INFO lorax_launcher: server.py:263 Server started at unix:///tmp/lorax-server-0
2023-12-07T06:58:35.748970Z INFO shard-manager: lorax_launcher: Shard ready in 8.511468629s rank=0 2023-12-07T06:58:35.843122Z INFO lorax_launcher: Starting Webserver 2023-12-07T06:58:35.855480Z WARN lorax_router: router/src/main.rs:169: Could not find a fast tokenizer implementation for /root/autodl-tmp/Qwen-14B-Chat-Int4 2023-12-07T06:58:35.855561Z WARN lorax_router: router/src/main.rs:172: Rust input length validation and truncation is disabled 2023-12-07T06:58:35.855586Z WARN lorax_router: router/src/main.rs:197: no pipeline tag found for model /root/autodl-tmp/Qwen-14B-Chat-Int4 2023-12-07T06:58:35.876373Z INFO lorax_router: router/src/main.rs:216: Warming up model 2023-12-07T06:58:37.188252Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error. Traceback (most recent call last): File "/root/lorax/server/lorax_server/models/flash_causallm.py", line 843, in warmup , batch = self.generate_token(batch) File "/root/miniconda3/envs/lorax/lib/python3.9/contextlib.py", line 79, in inner return func(*args, kwds) File "/root/lorax/server/lorax_server/models/flash_causal_lm.py", line 939, in generate_token raise e File "/root/lorax/server/lorax_server/models/flash_causal_lm.py", line 936, in generate_token out = self.forward(batch, adapter_data) File "/root/lorax/server/lorax_server/models/flash_causal_lm.py", line 895, in forward return self.model.forward( File "/root/lorax/server/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 471, in forward hidden_states = self.transformer( File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/lorax/server/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 427, in forward hidden_states, residual = layer( File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/root/lorax/server/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 352, in forward attn_output = self.attn( File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/root/lorax/server/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 235, in forward paged_attn.reshape_and_cache( File "/root/lorax/server/lorax_server/utils/paged_attn.py", line 23, in reshape_and_cache cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slot_mapping) RuntimeError: expected scalar type Int but found Long
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/envs/lorax/bin/lorax-server", line 8, in
File "/root/lorax/server/lorax_server/interceptor.py", line 38, in intercept return await response File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/root/miniconda3/envs/lorax/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/root/lorax/server/lorax_server/server.py", line 74, in Warmup max_supported_total_tokens = self.model.warmup(batch) File "/root/lorax/server/lorax_server/models/flash_causal_lm.py", line 845, in warmup raise RuntimeError( RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease
--max-batch-prefill-tokens
2023-12-07T06:58:37.188701Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens
"))
2023-12-07T06:58:37.245694Z ERROR lorax_launcher: Webserver Crashed
2023-12-07T06:58:37.245719Z INFO lorax_launcher: Shutting down shards
2023-12-07T06:58:37.510784Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed
Setting max_prefill_tokens to be the same with max_input_length isn't working.
2023-12-07T07:09:11.100552Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1024}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens
"))
2023-12-07T07:09:11.154507Z ERROR lorax_launcher: Webserver Crashed
2023-12-07T07:09:11.154534Z INFO lorax_launcher: Shutting down shards
2023-12-07T07:09:11.441994Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed
Even change the max-input-length and max-batch-prefill-tokens down to 100 tokens, it still pops up the Not Enough Memory warning. And I noticed the memory utilization rate jumps to 100% before it crushes.
Thanks @felixstander!
@KrisWongz it looks like the model weights
.bin
file is trying to execute some code on deserialization. I wasn't able to repro this using the base model from Huggingface here: https://huggingface.co/jxy/Tongyi-Finance-14B-Chat.This is one of the issues with pickle, though, which is it can do unpredictable things like this. Can you try converting the weights to safetensors format and trying again?
Thanks a lot! I successfully ran qwen and multi lora. My solution is to convert qwen to .safetensors locally.
But there is currently a small problem that I have been dealing with for a long time. I use qwen to reason and cannot stop until the maximum length limit every time. I judge that this may be due to stop words. In 'generate()', I found that there is a 'stop=[]' parameter, but I got an error when I tried to enter it.
client.generate(prompt, max_new_tokens=32,temperature=0.7,stop=["<|im_end|>"]).generated_text
error:
Traceback (most recent call last): File "/home/shaohongen/Temp/WZ_test/lorax/test_lorax_qwen.py", line 20, in
print(client.generate(prompt, max_new_tokens=32,temperature=0.7,stop=["<|im_end|>"]).generated_text) TypeError: generate() got an unexpected keyword argument 'stop'
But I tried other parameters successfully, except 'stop':
generate{parameters=GenerateParameters { adapter_id: None, adapter_source: None, best_of: None, temperature: Some(0.7), repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: 32, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None }
By the way, 'do_sample=True' will make output better, but not every time.
Hey @KrisWongz, can you try using the param stop_sequences
instead of stop
?
Example:
client.generate(prompt, max_new_tokens=32, temperature=0.7, stop_sequences=["<|im_end|>"]).generated_text
Hey @felixstander, it looks like the error about decreasing the max batch size is misleading, the actual error here is:
File "/root/lorax/server/lorax_server/utils/paged_attn.py", line 23, in reshape_and_cache
cache_ops.reshape_and_cache(key, value, key_cache, value_cache, slot_mapping)
RuntimeError: expected scalar type Int but found Long
Let me see if I can reproduce this error on my side.
Hey @felixstander, I wasn't able to reproduce the error using the most recent Docker image. Can you try pulling the latest Docker and trying again? There were a couple recent changes for GPT-Q that may have fixed this issue.
Hey @KrisWongz, can you try using the param
stop_sequences
instead ofstop
?Example:
client.generate(prompt, max_new_tokens=32, temperature=0.7, stop_sequences=["<|im_end|>"]).generated_text
It works on the base model, but it seems useless after adding the lora adapter. It may be a problem with the template settings when I fine-tuned it. But I ran it successfully under qwen's original web_demo. I'll keep trying, thanks for your help!
Any plan for ChatGLM model support ? thanks.
Hey @thincal, we can definitely add ChatGLM support. I can create a separate issue to track that.
@tgaddair it seems that qwen model type is qwen2 now, so what's the supported version in current implementation of lorax ? Ref: https://modelscope.cn/models/qwen/Qwen1.5-14B-Chat-AWQ/file/view/master/config.json
Feature request
Cool job! I have successfully run mulit-lora with llama2-70b. I would like to ask if the author has any plans to support other models, such as qwen, which would be very helpful.
Motivation
null
Your contribution
null