4bit quantized model using bnb not able to inference - Githubissues

huggingface / text-generation-inference

Large Language Model Text Generation Inference

http://hf.co/docs/text-generation-inference

Apache License 2.0

8.36k stars 948 forks source link

4bit quantized model using bnb not able to inference #2025

Open abadjatya opened 1 month ago

abadjatya commented 1 month ago

System Info

tgi version - latest . The model is cohere aya 35B, 4bit bnb quantized model . Originally I quantized base model and merged finetuned adapters with it.

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

I am using this command to spawn a docker using runpod

pod = runpod.create_pod( name="Testing pod", image_name="ghcr.io/huggingface/text-generation-inference:latest", gpu_type_id="NVIDIA A100 80GB PCIe", cloud_type="SECURE", docker_args=f"--model-id {model_id} --num-shard {num_shard}", gpu_count=num_shard, volume_in_gb=195, container_disk_in_gb=5, ports="80/http,29500/http", volume_mount_path="/data", )

I get this error in container logs

2024-06-05T18:08:23.230497824Z [2m2024-06-05T18:08:23.230282Z[0m [31mERROR[0m [2mtext_generation_launcher[0m[2m:[0m Method Warmup encountered an error. 2024-06-05T18:08:23.230523735Z Traceback (most recent call last): 2024-06-05T18:08:23.230527505Z File "/opt/conda/bin/text-generation-server", line 8, in <module> 2024-06-05T18:08:23.230530245Z sys.exit(app()) 2024-06-05T18:08:23.230532995Z File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__ 2024-06-05T18:08:23.230535535Z return get_command(self)(*args, **kwargs) 2024-06-05T18:08:23.230537615Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ 2024-06-05T18:08:23.230539825Z return self.main(*args, **kwargs) 2024-06-05T18:08:23.230542145Z File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main 2024-06-05T18:08:23.230544245Z return _main( 2024-06-05T18:08:23.230546275Z File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main 2024-06-05T18:08:23.230548345Z rv = self.invoke(ctx) 2024-06-05T18:08:23.230550985Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke 2024-06-05T18:08:23.230552985Z return _process_result(sub_ctx.command.invoke(sub_ctx)) 2024-06-05T18:08:23.230555725Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke 2024-06-05T18:08:23.230557745Z return ctx.invoke(self.callback, **ctx.params) 2024-06-05T18:08:23.230559795Z File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke 2024-06-05T18:08:23.230561745Z return __callback(*args, **kwargs) 2024-06-05T18:08:23.230563775Z File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper 2024-06-05T18:08:23.230565705Z return callback(**use_params) # type: ignore 2024-06-05T18:08:23.230567765Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 91, in serve 2024-06-05T18:08:23.230570535Z server.serve( 2024-06-05T18:08:23.230572585Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 261, in serve 2024-06-05T18:08:23.230574585Z asyncio.run( 2024-06-05T18:08:23.230576875Z File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run 2024-06-05T18:08:23.230578905Z return loop.run_until_complete(main) 2024-06-05T18:08:23.230581045Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete 2024-06-05T18:08:23.230583115Z self.run_forever() 2024-06-05T18:08:23.230585165Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever 2024-06-05T18:08:23.230603875Z self._run_once() 2024-06-05T18:08:23.230606205Z File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once 2024-06-05T18:08:23.230608495Z handle._run() 2024-06-05T18:08:23.230610635Z File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run 2024-06-05T18:08:23.230612735Z self._context.run(self._callback, *self._args) 2024-06-05T18:08:23.230616995Z File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method 2024-06-05T18:08:23.230620655Z return await self.intercept( 2024-06-05T18:08:23.230622765Z > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept 2024-06-05T18:08:23.230624865Z return await response 2024-06-05T18:08:23.230627035Z File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor 2024-06-05T18:08:23.230630215Z raise error 2024-06-05T18:08:23.230632285Z File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor 2024-06-05T18:08:23.230634625Z return await behavior(request_or_iterator, context) 2024-06-05T18:08:23.230636765Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 122, in Warmup 2024-06-05T18:08:23.230638735Z max_supported_total_tokens = self.model.warmup(batch) 2024-06-05T18:08:23.230640765Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 862, in warmup 2024-06-05T18:08:23.230642735Z _, batch, _ = self.generate_token(batch) 2024-06-05T18:08:23.230644795Z File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner 2024-06-05T18:08:23.230646815Z return func(*args, **kwds) 2024-06-05T18:08:23.230648975Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1094, in generate_token 2024-06-05T18:08:23.230651075Z out, speculative_logits = self.forward(batch) 2024-06-05T18:08:23.230653165Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1047, in forward 2024-06-05T18:08:23.230655385Z logits, speculative_logits = self.model.forward( 2024-06-05T18:08:23.230657425Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 518, in forward 2024-06-05T18:08:23.230659595Z hidden_states = self.model( 2024-06-05T18:08:23.230661746Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-05T18:08:23.230663806Z return self._call_impl(*args, **kwargs) 2024-06-05T18:08:23.230665876Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-05T18:08:23.230667916Z return forward_call(*args, **kwargs) 2024-06-05T18:08:23.230670026Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 468, in forward 2024-06-05T18:08:23.230672216Z hidden_states, residual = layer( 2024-06-05T18:08:23.230674716Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-05T18:08:23.230676776Z return self._call_impl(*args, **kwargs) 2024-06-05T18:08:23.230678846Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-05T18:08:23.230680836Z return forward_call(*args, **kwargs) 2024-06-05T18:08:23.230683006Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 396, in forward 2024-06-05T18:08:23.230685026Z attn_output = self.self_attn( 2024-06-05T18:08:23.230687126Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-05T18:08:23.230689096Z return self._call_impl(*args, **kwargs) 2024-06-05T18:08:23.230691106Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-05T18:08:23.230695706Z return forward_call(*args, **kwargs) 2024-06-05T18:08:23.230697896Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 266, in forward 2024-06-05T18:08:23.230699936Z qkv = self.query_key_value(hidden_states) 2024-06-05T18:08:23.230702056Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-05T18:08:23.230704086Z return self._call_impl(*args, **kwargs) 2024-06-05T18:08:23.230706126Z File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-05T18:08:23.230708106Z return forward_call(*args, **kwargs) 2024-06-05T18:08:23.230710116Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 33, in forward 2024-06-05T18:08:23.230712146Z return self.linear.forward(x) 2024-06-05T18:08:23.230714216Z File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/linear.py", line 36, in forward 2024-06-05T18:08:23.230716226Z return F.linear(input, self.weight, self.bias) 2024-06-05T18:08:23.230718366Z RuntimeError: mat1 and mat2 shapes cannot be multiplied (4145x8192 and 1x50331648) 2024-06-05T18:08:23.418122187Z [2m2024-06-05T18:08:23.417880Z[0m [31mERROR[0m [1mwarmup[0m[1m{[0m[3mmax_input_length[0m[2m=[0m4095 [3mmax_prefill_tokens[0m[2m=[0m4145 [3mmax_total_tokens[0m[2m=[0m4096 [3mmax_batch_size[0m[2m=[0mNone[1m}[0m[2m:[0m[1mwarmup[0m[2m:[0m [2mtext_generation_client[0m[2m:[0m [2mrouter/client/src/lib.rs[0m[2m:[0m[2m46:[0m Server error: CANCELLED 2024-06-05T18:08:23.478237787Z Error: WebServer(Warmup(Generation("CANCELLED")))

There is the error of matrix dimension mismatch.

Expected behavior

The model should be available for inference.

LysandreJik commented 3 weeks ago

Hey @arihant-neohuman, the docs recommend using --quantize bitsandbytes as an argument to docker run in order to use bitsandbytes.

Have you tried that setting?