microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.88k stars 175 forks source link

fail to run llama-2-7B and llama-2-13B #327

Open xzzWZY opened 11 months ago

xzzWZY commented 11 months ago

when I use

  import mii
  client = mii.serve("/metaai/Llama-2-13b-chat-hf")
  response = client.generate(["Deepspeed is", "Seattle is"], max_new_tokens=128)
  print(response)

to serve Llama-2 with DeepSpeed-MII, I encounter issue:

[2023-11-27 19:11:43,131] [INFO] [kv_cache.py:125:init] Allocating KV-cache with shape: (40, 382, 64, 2, 40, 128) consisting of 382 blocks. Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/DeepSpeed-MII/mii/launch/multi_gpu_server.py", line 95, in main() File "/home/DeepSpeed-MII/mii/launch/multi_gpu_server.py", line 88, in main inference_pipeline = async_pipeline(args.model_config) File "/home/DeepSpeed-MII/mii/api.py", line 171, in async_pipeline tokenizer = load_tokenizer(model_config) File "/home/DeepSpeed-MII/mii/modeling/tokenizers.py", line 66, in load_tokenizer tokenizer = HFTokenizer(model_config.tokenizer) File "/home/DeepSpeed-MII/mii/modeling/tokenizers.py", line 44, in init tokenizer = AutoTokenizer.from_pretrained(tokenizer) File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 768, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, *kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained return cls._from_pretrained( File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained tokenizer = cls(init_inputs, **init_kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama_fast.py", line 124, in init super().init( File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 120, in init raise ValueError( ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one. [2023-11-27 19:11:44,218] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6775 [2023-11-27 19:11:44,219] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-m', 'mii.launch.multi_gpu_server', '--deployment-name', 'Llama-2-13b-chat-hf-mii-deployment', '--load-balancer-port', '50050', '--restful-gateway-port', '51080', '--server-port', '50051', '--zmq-port', '25555', '--model-config', 'eyJtb2RlbF9uYW1lX29yX3BhdGgiOiAiL21ldGFhaS9MbGFtYS0yLTEzYi1jaGF0LWhmIiwgInRva2VuaXplciI6ICIvbWV0YWFpL0xsYW1hLTItMTNiLWNoYXQtaGYiLCAidGFzayI6ICJ0ZXh0LWdlbmVyYXRpb24iLCAidGVuc29yX3BhcmFsbGVsIjogMSwgImluZmVyZW5jZV9lbmdpbmVfY29uZmlnIjogeyJ0ZW5zb3JfcGFyYWxsZWwiOiB7InRwX3NpemUiOiAxfSwgInN0YXRlX21hbmFnZXIiOiB7Im1heF90cmFja2VkX3NlcXVlbmNlcyI6IDIwNDgsICJtYXhfcmFnZ2VkX2JhdGNoX3NpemUiOiA3NjgsICJtYXhfcmFnZ2VkX3NlcXVlbmNlX2NvdW50IjogNTEyLCAibWF4X2NvbnRleHQiOiA4MTkyLCAibWVtb3J5X2NvbmZpZyI6IHsibW9kZSI6ICJyZXNlcnZlIiwgInNpemUiOiAxMDAwMDAwMDAwfSwgIm9mZmxvYWQiOiBmYWxzZX19LCAidG9yY2hfZGlzdF9wb3J0IjogMjk1MDAsICJ6bXFfcG9ydF9udW1iZXIiOiAyNTU1NSwgInJlcGxpY2FfbnVtIjogMSwgInJlcGxpY2FfY29uZmlncyI6IFt7Imhvc3RuYW1lIjogImxvY2FsaG9zdCIsICJ0ZW5zb3JfcGFyYWxsZWxfcG9ydHMiOiBbNTAwNTFdLCAidG9yY2hfZGlzdF9wb3J0IjogMjk1MDAsICJncHVfaW5kaWNlcyI6IFswXSwgInptcV9wb3J0IjogMjU1NTV9XSwgIm1heF9sZW5ndGgiOiBudWxsLCAiYWxsX3Jhbmtfb3V0cHV0IjogZmFsc2UsICJzeW5jX2RlYnVnIjogZmFsc2UsICJwcm9maWxlX21vZGVsX3RpbWUiOiBmYWxzZX0='] exits with return code = 1 [2023-11-27 19:11:44,343] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start... [2023-11-27 19:11:44,343] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start... Traceback (most recent call last): File "/home/DeepSpeed-MII/client.py", line 2, in client = mii.serve("/metaai/Llama-2-13b-chat-hf") File "/home/DeepSpeed-MII/mii/api.py", line 127, in serve import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init() File "/tmp/mii_cache/Llama-2-13b-chat-hf-mii-deployment/score.py", line 33, in init mii.backend.MIIServer(mii_config) File "/home/DeepSpeed-MII/mii/backend/server.py", line 47, in init self._wait_until_server_is_live(processes, File "/home/DeepSpeed-MII/mii/backend/server.py", line 62, in _wait_until_server_is_live raise RuntimeError( RuntimeError: server crashed for some reason, unable to proceed

Then I try to replace llama-2-13B-chat with llama-2-7B-chat, I encountered another error:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/DeepSpeed-MII/mii/launch/multi_gpu_server.py", line 95, in main() File "/home/DeepSpeed-MII/mii/launch/multi_gpu_server.py", line 88, in main inference_pipeline = async_pipeline(args.model_config) File "/home/DeepSpeed-MII/mii/api.py", line 172, in async_pipeline inference_pipeline = MIIAsyncPipeline( File "/home/DeepSpeed-MII/mii/batching/ragged_batching.py", line 541, in init super().init(*args, **kwargs) File "/home/DeepSpeed-MII/mii/batching/ragged_batching.py", line 77, in init self.scheduled_req_blocks = torch.zeros(inference_engine.n_kv_cache_groups, AttributeError: 'InferenceEngineV2' object has no attribute 'n_kv_cache_groups' [2023-11-27 19:15:37,906] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7591 [2023-11-27 19:15:37,906] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-m', 'mii.launch.multi_gpu_server', '--deployment-name', 'Llama-2-7b-chat-hf-mii-deployment', '--load-balancer-port', '50050', '--restful-gateway-port', '51080', '--server-port', '50051', '--zmq-port', '25555', '--model-config', 'eyJtb2RlbF9uYW1lX29yX3BhdGgiOiAiL21ldGFhaS9MbGFtYS0yLTdiLWNoYXQtaGYiLCAidG9rZW5pemVyIjogIi9tZXRhYWkvTGxhbWEtMi03Yi1jaGF0LWhmIiwgInRhc2siOiAidGV4dC1nZW5lcmF0aW9uIiwgInRlbnNvcl9wYXJhbGxlbCI6IDEsICJpbmZlcmVuY2VfZW5naW5lX2NvbmZpZyI6IHsidGVuc29yX3BhcmFsbGVsIjogeyJ0cF9zaXplIjogMX0sICJzdGF0ZV9tYW5hZ2VyIjogeyJtYXhfdHJhY2tlZF9zZXF1ZW5jZXMiOiAyMDQ4LCAibWF4X3JhZ2dlZF9iYXRjaF9zaXplIjogNzY4LCAibWF4X3JhZ2dlZF9zZXF1ZW5jZV9jb3VudCI6IDUxMiwgIm1heF9jb250ZXh0IjogODE5MiwgIm1lbW9yeV9jb25maWciOiB7Im1vZGUiOiAicmVzZXJ2ZSIsICJzaXplIjogMTAwMDAwMDAwMH0sICJvZmZsb2FkIjogZmFsc2V9fSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiem1xX3BvcnRfbnVtYmVyIjogMjU1NTUsICJyZXBsaWNhX251bSI6IDEsICJyZXBsaWNhX2NvbmZpZ3MiOiBbeyJob3N0bmFtZSI6ICJsb2NhbGhvc3QiLCAidGVuc29yX3BhcmFsbGVsX3BvcnRzIjogWzUwMDUxXSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiZ3B1X2luZGljZXMiOiBbMF0sICJ6bXFfcG9ydCI6IDI1NTU1fV0sICJtYXhfbGVuZ3RoIjogbnVsbCwgImFsbF9yYW5rX291dHB1dCI6IGZhbHNlLCAic3luY19kZWJ1ZyI6IGZhbHNlLCAicHJvZmlsZV9tb2RlbF90aW1lIjogZmFsc2V9'] exits with return code = 1 [2023-11-27 19:15:39,023] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start... [2023-11-27 19:15:39,023] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start... Traceback (most recent call last): File "/home/DeepSpeed-MII/client.py", line 2, in client = mii.serve("/metaai/Llama-2-7b-chat-hf") File "/home/DeepSpeed-MII/mii/api.py", line 127, in serve import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init() File "/tmp/mii_cache/Llama-2-7b-chat-hf-mii-deployment/score.py", line 33, in init mii.backend.MIIServer(mii_config) File "/home/DeepSpeed-MII/mii/backend/server.py", line 47, in init self._wait_until_server_is_live(processes, File "/home/DeepSpeed-MII/mii/backend/server.py", line 62, in _wait_until_server_is_live raise RuntimeError( RuntimeError: server crashed for some reason, unable to proceed

Llama-2-13B-chat and Llama-2-7B-chat are from HF: meta-llama/Llama-2-13b-chat-hf

mrwyattii commented 11 months ago

@xzzWZY This looks like a bug that we fixed in the latest DeepSpeed. We will push a release soon. In the meantime, please install with: pip install git+https://github.com/Microsoft/DeepSpeed.git git+https://github.com/Microsoft/DeepSpeed-MII.git

yechong316 commented 6 months ago

@xzzWZY This looks like a bug that we fixed in the latest DeepSpeed. We will push a release soon. In the meantime, please install with: pip install git+https://github.com/Microsoft/DeepSpeed.git git+https://github.com/Microsoft/DeepSpeed-MII.git


my python env is that: deepspeed 0.14.1+a8b82153 deepspeed-kernels 0.0.1.dev1698255861 deepspeed-mii 0.2.4+26a853d

but error is still show.... [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible usage: multi_gpu_server.py [-h] [--deployment-name DEPLOYMENT_NAME] [--model-config MODEL_CONFIG] [--server-port SERVER_PORT] [--zmq-port ZMQ_PORT] [--load-balancer] [--load-balancer-port LOAD_BALANCER_PORT] [--restful-gateway] [--restful-gateway-port RESTFUL_GATEWAY_PORT] [--restful-gateway-host RESTFUL_GATEWAY_HOST] [--restful-gateway-procs RESTFUL_GATEWAY_PROCS] multi_gpu_server.py: error: argument --deployment-name: expected one argument [2024-04-12 00:45:57,354] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2282 [2024-04-12 00:45:57,354] [ERROR] [launch.py:322:sigkill_handler] ['/root/miniconda3/bin/python', '-m', 'mii.launch.multi_gpu_server', '--deployment-name', '-mii-deployment', '--load-balancer-port', '50050', '--restful-gateway-port', '51080', '--restful-gateway-host', 'localhost', '--restful-gateway-procs', '32', '--server-port', '50051', '--zmq-port', '25555', '--model-config', 'eyJtb2RlbF9uYW1lX29yX3BhdGgiOiAiL3Jvb3QvLmNhY2hlL21vZGVsc2NvcGUvaHViL0FJLU1vZGVsU2NvcGUvcGhpLTIvIiwgInRva2VuaXplciI6ICIvcm9vdC8uY2FjaGUvbW9kZWxzY29wZS9odWIvQUktTW9kZWxTY29wZS9waGktMi8iLCAidGFzayI6ICJ0ZXh0LWdlbmVyYXRpb24iLCAidGVuc29yX3BhcmFsbGVsIjogMSwgInF1YW50aXphdGlvbl9tb2RlIjogbnVsbCwgImluZmVyZW5jZV9lbmdpbmVfY29uZmlnIjogeyJ0ZW5zb3JfcGFyYWxsZWwiOiB7InRwX3NpemUiOiAxfSwgInN0YXRlX21hbmFnZXIiOiB7Im1heF90cmFja2VkX3NlcXVlbmNlcyI6IDIwNDgsICJtYXhfcmFnZ2VkX2JhdGNoX3NpemUiOiA3NjgsICJtYXhfcmFnZ2VkX3NlcXVlbmNlX2NvdW50IjogNTEyLCAibWF4X2NvbnRleHQiOiA4MTkyLCAibWVtb3J5X2NvbmZpZyI6IHsibW9kZSI6ICJyZXNlcnZlIiwgInNpemUiOiAxMDAwMDAwMDAwfSwgIm9mZmxvYWQiOiBmYWxzZX0sICJxdWFudGl6YXRpb24iOiB7InF1YW50aXphdGlvbl9tb2RlIjogbnVsbH19LCAidG9yY2hfZGlzdF9wb3J0IjogMjk1MDAsICJ6bXFfcG9ydF9udW1iZXIiOiAyNTU1NSwgInJlcGxpY2FfbnVtIjogMSwgInJlcGxpY2FfY29uZmlncyI6IFt7Imhvc3RuYW1lIjogImxvY2FsaG9zdCIsICJ0ZW5zb3JfcGFyYWxsZWxfcG9ydHMiOiBbNTAwNTFdLCAidG9yY2hfZGlzdF9wb3J0IjogMjk1MDAsICJncHVfaW5kaWNlcyI6IFswXSwgInptcV9wb3J0IjogMjU1NTV9XSwgImRldmljZV9tYXAiOiAiYXV0byIsICJtYXhfbGVuZ3RoIjogbnVsbCwgInN5bmNfZGVidWciOiBmYWxzZSwgInByb2ZpbGVfbW9kZWxfdGltZSI6IGZhbHNlfQ=='] exits with return code = 2 [2024-04-12 00:45:57,633] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start... [2024-04-12 00:45:57,633] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start... [2024-04-12 00:46:02,636] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start... [2024-04-12 00:46:02,636] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start... Traceback (most recent call last): File "", line 1, in File "/root/miniconda3/lib/python3.10/site-packages/mii/api.py", line 155, in serve import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init() File "/tmp/mii_cache/-mii-deployment/score.py", line 33, in init mii.backend.MIIServer(mii_config) File "/root/miniconda3/lib/python3.10/site-packages/mii/backend/server.py", line 47, in init self._wait_until_server_is_live(processes, File "/root/miniconda3/lib/python3.10/site-packages/mii/backend/server.py", line 62, in _wait_until_server_is_live raise RuntimeError( RuntimeError: server crashed for some reason, unable to proceed