DeepSeek Coder V2: sharded is not supported for AutoModel

freegheist commented 6 days ago

System Info

ghcr.io/huggingface/text-generation-inference:2.0.4 & 2.1.0 Ubuntu 22.04 server, 8xA6000.

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

{ model_id: "deepseek-ai/DeepSeek-Coder-V2-Instruct", revision: None, validation_workers: 15, sharded: None, num_shard: Some( 8, ), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some( 28672, ), max_total_tokens: Some( 32768, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: Some( 28672, ), max_batch_total_tokens: Some( 32768, ), max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: Some( [ 1, 2, 4, 8, 16, 32, ], ), hostname: "0.0.0.0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.99, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, }

Expected behavior

2024-06-30T02:56:11.196310Z ERROR text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 900, in get_model raise NotImplementedError("sharded is not supported for AutoModel") NotImplementedError: sharded is not supported for AutoModel

danieldk commented 4 days ago

Deepseek v2 does not currently have an implementation in TGI, so it will revert to the upstream implementation which does not support sharding. We are currently working on native support for DeepSeek V2.

freegheist commented 4 days ago

Deepseek v2 does not currently have an implementation in TGI, so it will revert to the upstream implementation which does not support sharding. We are currently working on native support for DeepSeek V2.

that would be amazing! thanks 🙏

huggingface / text-generation-inference