microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.83k stars 172 forks source link

Return code -9 for OPT with 8x40GB A100 GPUs #198

Open Mutinifni opened 1 year ago

Mutinifni commented 1 year ago

Hello,

I'm running the following code snippet in opt.py.

import mii
mii_configs = {"tensor_parallel": 8, "dtype": "fp16", "load_with_sys_mem": True}
mii.deploy(task="text-generation", model="facebook/opt-66b", deployment_name="opt", mii_config=mii_configs)

However, it runs into the following issue wherein it just exits randomly:

❯ python opt.py
[2023-06-01 00:07:01,072] [INFO] [deployment.py:87:deploy] ************* MII is using DeepSpeed Optimizations to accelerate your model *************
[2023-06-01 00:07:01,147] [INFO] [server_client.py:219:_initialize_service] MII using multi-gpu deepspeed launcher:
 ------------------------------------------------------------
 task-name .................... text-generation
 model ........................ facebook/opt-66b
 model-path ................... /tmp/mii_models
 port ......................... 50050
 provider ..................... hugging-face
 ------------------------------------------------------------
[2023-06-01 00:07:02,641] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-01 00:07:02,711] [INFO] [runner.py:541:main] cmd = /home/azureuser/miniconda3/envs/gptneox/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --no_python --no_local_rank --enable_each_rank_log=None /home/azureuser/miniconda3/envs/gptneox/bin/python -m mii.launch.multi_gpu_server --task-name text-generation --model facebook/opt-66b --model-path /tmp/mii_models --port 50050 --ds-optimize --provider hugging-face --config eyJ0ZW5zb3JfcGFyYWxsZWwiOiA4LCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogInRvcmNoLmZsb2F0MTYiLCAibG9hZF93aXRoX3N5c19tZW0iOiB0cnVlLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGwsICJkZXBsb3lfcmFuayI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiaGZfYXV0aF90b2tlbiI6IG51bGwsICJyZXBsYWNlX3dpdGhfa2VybmVsX2luamVjdCI6IHRydWUsICJwcm9maWxlX21vZGVsX3RpbWUiOiBmYWxzZSwgInNraXBfbW9kZWxfY2hlY2siOiBmYWxzZX0=
[2023-06-01 00:07:04,194] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-06-01 00:07:04,194] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-06-01 00:07:04,194] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-06-01 00:07:04,194] [INFO] [launch.py:247:main] dist_world_size=8
[2023-06-01 00:07:04,194] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-06-01 00:07:06,163] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[edited out spam]
[2023-06-01 00:25:32,433] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2023-06-01 00:28:11,415] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63867
[2023-06-01 00:28:13,382] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63868
[2023-06-01 00:28:14,431] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2023-06-01 00:28:15,556] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63869
[2023-06-01 00:28:17,729] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63870
[2023-06-01 00:28:19,436] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2023-06-01 00:28:19,823] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63871
[2023-06-01 00:28:21,743] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63872
[2023-06-01 00:28:23,145] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63873
[2023-06-01 00:28:24,440] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
[2023-06-01 00:28:24,467] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 63874
[2023-06-01 00:28:24,468] [ERROR] [launch.py:434:sigkill_handler] ['/home/azureuser/miniconda3/envs/gptneox/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-generation', '--model', 'facebook/opt-66b', '--model-path', '/tmp/mii_models', '--port', '50050', '--ds-optimize', '--provider', 'hugging-face', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiA4LCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogInRvcmNoLmZsb2F0MTYiLCAibG9hZF93aXRoX3N5c19tZW0iOiB0cnVlLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGwsICJkZXBsb3lfcmFuayI6IFswLCAxLCAyLCAzLCA0LCA1LCA2LCA3XSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiaGZfYXV0aF90b2tlbiI6IG51bGwsICJyZXBsYWNlX3dpdGhfa2VybmVsX2luamVjdCI6IHRydWUsICJwcm9maWxlX21vZGVsX3RpbWUiOiBmYWxzZSwgInNraXBfbW9kZWxfY2hlY2siOiBmYWxzZX0='] exits with return code = -9
Traceback (most recent call last):
  File "opt.py", line 3, in <module>
    mii.deploy(task="text-generation", model="facebook/opt-66b", deployment_name="bloom", mii_config=mii_configs)
  File "/home/azureuser/miniconda3/envs/gptneox/lib/python3.8/site-packages/mii/deployment.py", line 114, in deploy
    return _deploy_local(deployment_name, model_path=model_path)
  File "/home/azureuser/miniconda3/envs/gptneox/lib/python3.8/site-packages/mii/deployment.py", line 120, in _deploy_local
    mii.utils.import_score_file(deployment_name).init()
  File "/tmp/mii_cache/bloom/score.py", line 30, in init
    model = mii.MIIServerClient(task,
  File "/home/azureuser/miniconda3/envs/gptneox/lib/python3.8/site-packages/mii/server_client.py", line 92, in __init__
    self._wait_until_server_is_live()
  File "/home/azureuser/miniconda3/envs/gptneox/lib/python3.8/site-packages/mii/server_client.py", line 115, in _wait_until_server_is_live
    raise RuntimeError("server crashed for some reason, unable to proceed")
RuntimeError: server crashed for some reason, unable to proceed

Through monitoring, I've found that this likely happens because the host machine runs out of memory (though it might be a symptom rather than the cause). However, the host machine has 885GB RAM, so I'm not sure why it uses up so much memory for loading the OPT-66B model -- it should consume much less. I also run into the same issue with bloom-175B int8 version.

Could someone please help me resolve this?

I am on the following library versions:

transformers: '4.30.0.dev0'
deepspeed: 0.9.2
mii: 0.0.4

Thanks!

mrwyattii commented 1 year ago

@Mutinifni the cause of this is that we load the model tensor_parallel times on system memory before sharding and moving to the GPUs. This means that it will eat up ~130GB x 8 for your use case. We actually have a method for avoiding this via the use of meta tensors, but currently MII only supports using meta tensors to load the BLOOM model. I think it should be a quick change to enable this for other models.

Let me get a PR for this change and I'll ask you to test it!

Mutinifni commented 1 year ago

Sounds great, thank you!

Could you let me know how to use the meta tensors approach for BLOOM?

mrwyattii commented 1 year ago

@Mutinifni can you try #199 and let me know if you are able to load the OPT-66B model?

Setup env:

pip install git+https://github.com/microsoft/DeepSpeed
pip install git+https://github.com/microsoft/DeepSpeed-MII@mrwyattii/extend-llm-provider

Example script:

import mii
mii_configs = {"tensor_parallel": 8, "dtype": "fp16", "meta_tensor": True}
mii.deploy(task="text-generation", model="facebook/opt-66b", deployment_name="opt", mii_config=mii_configs)
Mutinifni commented 1 year ago

Thanks for the quick turnaround!

I tried deploying the model, but it runs into a downloading issue wherein it keeps downloading the model and populates a bunch of tmp files. It did this despite the fact that I had opt-66b already downloaded; I tried deleting my local version and retried download, but ran into the same issue till my disk space ran out (215+GB).

Is this expected to take much more disk space?

mrwyattii commented 1 year ago

You can provide the path to where the checkpoints are already downloaded (it will likely be in ~/.cache/huggingface/hub/models--facebook--opt-66b/snapshots/<hash value>/) to mii.deploy(model_path="path/to/ckpt/")

I'm not 100% sure if you should provide the path above or perhaps ~/.cache/huggingface/hub/models--facebook--opt-66b. I would suggest playing around with different paths in that directory or deleting that cached model and allowing MII to download the model again.

The reason that MII is trying to download the model again is due to our use of huggingface_hub.snapshot_download to get all the necessary files for loading with meta tensors.

mrwyattii commented 1 year ago

Actually, I think there may be a bug in #199 - I'm currently using the following to filter which files to get with snapshot_download: allow_patterns=["*"], ignore_patterns=["*.safetensors"],

After looking at the facebook/opt-66b repo on HuggingFace, I think I will need to update the *_patterns to avoid downloading the *.h5 and *.msgpack files as well. I will update the PR shortly.

Mutinifni commented 1 year ago

I had passed the model directory. And yes, it is indeed downloading all those files as you mentioned:

❯ ls mii_models/models--facebook--opt-66b/snapshots/7259969061237fe940036d22bea0fd349e4485e9/
LICENSE.md                         generation_config.json            tf_model-00001-of-00014.h5
README.md                          merges.txt                        tf_model-00002-of-00014.h5
config.json                        pytorch_model-00001-of-00014.bin  tf_model-00003-of-00014.h5
flax_model-00001-of-00014.msgpack  pytorch_model-00002-of-00014.bin  tf_model-00004-of-00014.h5
flax_model-00002-of-00014.msgpack  pytorch_model-00003-of-00014.bin  tf_model-00005-of-00014.h5
flax_model-00003-of-00014.msgpack  pytorch_model-00004-of-00014.bin  tf_model-00006-of-00014.h5
flax_model-00004-of-00014.msgpack  pytorch_model-00005-of-00014.bin  tf_model-00007-of-00014.h5
flax_model-00005-of-00014.msgpack  pytorch_model-00006-of-00014.bin  tf_model-00008-of-00014.h5
flax_model-00006-of-00014.msgpack  pytorch_model-00007-of-00014.bin  tf_model-00009-of-00014.h5
flax_model-00007-of-00014.msgpack  pytorch_model-00008-of-00014.bin  tf_model-00010-of-00014.h5
flax_model-00008-of-00014.msgpack  pytorch_model-00009-of-00014.bin  tf_model-00011-of-00014.h5
flax_model-00009-of-00014.msgpack  pytorch_model-00010-of-00014.bin  tf_model-00012-of-00014.h5
flax_model-00010-of-00014.msgpack  pytorch_model-00011-of-00014.bin  tf_model-00013-of-00014.h5
flax_model-00011-of-00014.msgpack  pytorch_model-00012-of-00014.bin  tf_model-00014-of-00014.h5
flax_model-00012-of-00014.msgpack  pytorch_model-00013-of-00014.bin  tf_model.h5.index.json
flax_model-00013-of-00014.msgpack  pytorch_model-00014-of-00014.bin  tokenizer_config.json
flax_model-00014-of-00014.msgpack  pytorch_model.bin.index.json      vocab.json
flax_model.msgpack.index.json      special_tokens_map.json
mrwyattii commented 1 year ago

@Mutinifni I've updated #199 - could you please try again and verify that only the necessary pytorch model files are downloaded when using meta tensor to load models? Thanks!

Mutinifni commented 1 year ago

It doesn't download the extra files anymore.

However, the loading still crashes because of the following error (exits with return code = 1)

Traceback (most recent call last):
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/runpy.py", line 87, in _run_code
    return _run_code(code, main_globals, None,
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 104, in <module>
    exec(code, run_globals)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 104, in <module>
    main()
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 83, in main
    main()
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 83, in main
    inference_pipeline = load_models(task_name=args.task_name,
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/mii/models/load_models.py", line 82, in load_models                                                                                 inference_pipeline = load_models(task_name=args.task_name,                                                                                                                                                File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/mii/models/load_models.py", line 82, in load_models                                                                                 engine = deepspeed.init_inference(getattr(inference_pipeline,                                                                                                                                             File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/__init__.py", line 331, in init_inference                                                                                 engine = deepspeed.init_inference(getattr(inference_pipeline,
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/__init__.py", line 331, in init_inference
    ds_inference_config = DeepSpeedInferenceConfig(**config_dict)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/runtime/config_utils.py", line 56, in __init__
    ds_inference_config = DeepSpeedInferenceConfig(**config_dict)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/runtime/config_utils.py", line 56, in __init__
    super().__init__(**data)
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
    super().__init__(**data)
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for DeepSpeedInferenceConfig
checkpoint
  str type expected (type=type_error.str)pydantic.error_wrappers.ValidationError: 1 validation error for DeepSpeedInferenceConfig
checkpoint
  str type expected (type=type_error.str)
mrwyattii commented 1 year ago

Could you update DeepSpeed to use the latest master branch? I added a fix for this in https://github.com/microsoft/DeepSpeed/pull/3007

pip install git+https://github.com/microsoft/DeepSpeed

Mutinifni commented 1 year ago

Sorry, my bad! I had installed the latest deepspeed in a different conda environment by mistake.

I ran into this other error (code = 1) after trying with the new DeepSpeed:

Traceback (most recent call last):
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 104, in <module>
    main()
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 83, in main
    inference_pipeline = load_models(task_name=args.task_name,
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/mii/models/load_models.py", line 82, in load_models
    engine = deepspeed.init_inference(getattr(inference_pipeline,
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/__init__.py", line 333, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 192, in __init__
    self._apply_injection_policy(config)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 426, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 561, in replace_transformer_layer
    load_model_with_checkpoint(replaced_module,
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 252, in load_model_with_checkpoint
    load_module_recursive(r_module)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursive
    load_module_recursive(
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursive
    load_module_recursive(
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 246, in load_module_recursive
    load_module_recursive(
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 244, in load_module_recursive
    layer_policies[child.__class__](child, prefix + name + '.')
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 173, in load_transformer_layer
    container.load_params(module, sd[0], weight_quantizer, mp_replace, prefix)
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/module_inject/containers/opt.py", line 79, in load_params
    maybe_copy_qkv(module.attention,
  File "/home/azureuser/miniconda3/envs/metamii/lib/python3.8/site-packages/deepspeed/module_inject/policy.py", line 178, in maybe_copy_qkv
    k = sd[src_names[1]]
KeyError: 'model.decoder.layers.9.self_attn.k_proj.weight'

It seemed as if the model was not correctly being downloaded since it is unable to find the layer weights. However, retrying the download still yielded the same error, so I'm not sure. Is that key unexpected?

mrwyattii commented 1 year ago

@Mutinifni I have not come across this error with OPT models in the past. If it's not too much trouble, could you do a quick test with a smaller OPT variant and tell me if you see the same thing? Perhaps facebook/opt-13b?

Mutinifni commented 1 year ago

I can try this out tomorrow. Will loading a smaller model help? Since I expect that can fit on a single GPU.

Mutinifni commented 1 year ago

Update: it did work with opt-13b loaded on 8 GPUs.

Weirdly, nvidia-smi showed many loaded processes. Is that expected?

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      8944      C   .../miniconda3/envs/metamii/bin/python     5138MiB |
|    0   N/A  N/A      8945      C   .../miniconda3/envs/metamii/bin/python      846MiB |
|    0   N/A  N/A      8946      C   .../miniconda3/envs/metamii/bin/python      846MiB |
|    0   N/A  N/A      8947      C   .../miniconda3/envs/metamii/bin/python      846MiB |
|    0   N/A  N/A      8948      C   .../miniconda3/envs/metamii/bin/python      846MiB |
|    0   N/A  N/A      8949      C   .../miniconda3/envs/metamii/bin/python      846MiB |
|    0   N/A  N/A      8950      C   .../miniconda3/envs/metamii/bin/python      846MiB |
|    0   N/A  N/A      8951      C   .../miniconda3/envs/metamii/bin/python      846MiB |
|    1   N/A  N/A      8945      C   .../miniconda3/envs/metamii/bin/python     5282MiB |
|    2   N/A  N/A      8946      C   .../miniconda3/envs/metamii/bin/python     5282MiB |
|    3   N/A  N/A      8947      C   .../miniconda3/envs/metamii/bin/python     5282MiB |
|    4   N/A  N/A      8948      C   .../miniconda3/envs/metamii/bin/python     5282MiB |
|    5   N/A  N/A      8949      C   .../miniconda3/envs/metamii/bin/python     5282MiB |
|    6   N/A  N/A      8950      C   .../miniconda3/envs/metamii/bin/python     5282MiB |
|    7   N/A  N/A      8951      C   .../miniconda3/envs/metamii/bin/python     5138MiB |
+---------------------------------------------------------------------------------------+
mrwyattii commented 1 year ago

I can try this out tomorrow. Will loading a smaller model help? Since I expect that can fit on a single GPU.

I've been testing with smaller models since my local system doesn't have enough memory for the 66B model. I was curious if the error you shared was only happening with the larger model. It looks like it is, so I'll spin up a larger instance and debug the 66B issue. Thank you for verifying!