pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.17k stars 846 forks source link

Failure in loading Deepspeed large model example #2569

Open sachanub opened 1 year ago

sachanub commented 1 year ago

🐛 Describe the bug

I am trying to follow the example to perform inference with the OPT-30B model according to this example: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed

However, as specified in the model-config.yaml file, a checkpoints.json file is required. This file gets used here: https://github.com/pytorch/serve/blob/master/ts/handler_utils/distributed/deepspeed.py#L40

As a result, the model fails to load. The error logs are attached below.

Error logs

2023-09-05T23:22:14,652 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Failed to load model opt, exception Cannot copy out of meta tensor; no data!
2023-09-05T23:22:14,652 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     service = model_loader.load(
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/model_loader.py", line 135, in load
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/home/model-server/tmp/models/c1130e4b01c345b9be913ef8414518cb/custom_handler.py", line 55, in initialize
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     ds_engine = get_ds_engine(self.model, ctx)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/handler_utils/distributed/deepspeed.py", line 35, in get_ds_engine
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     ds_engine = deepspeed.init_inference(
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     engine = InferenceEngine(model, config=ds_inference_config)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 154, in __init__
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     self.module.to(device)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2053, in to
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     return super().to(*args, **kwargs)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     return self._apply(convert)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     module._apply(fn)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     module._apply(fn)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     module._apply(fn)
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
2023-09-05T23:22:14,653 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     param_applied = fn(param)
2023-09-05T23:22:14,654 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
2023-09-05T23:22:14,654 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2023-09-05T23:22:14,654 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - NotImplementedError: Cannot copy out of meta tensor; no data!

Installation instructions

Docker image URI: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2 EC2 instance: g5dn.24xlarge

Model Packaing

Created model artifact by following this example: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed

config.properties

No response

Versions

------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.8.1
torch-model-archiver==0.8.1

Python version: 3.10 (64-bit runtime)
Python executable: /opt/conda/bin/python3

Versions of relevant python libraries:
captum==0.6.0
numpy==1.22.4
nvgpu==0.10.0
psutil==5.9.5
requests==2.31.0
torch==2.0.1+cu118
torch-model-archiver==0.8.1
torchaudio==2.0.2+cu118
torchdata==0.6.1+cu118
torchserve==0.8.1
torchtext==0.15.2+cu118
torchvision==0.15.2+cu118
wheel==0.38.4
torch==2.0.1+cu118
torchtext==0.15.2+cu118
torchvision==0.15.2+cu118
torchaudio==2.0.2+cu118

Java Version:

OS: N/A
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: N/A
CMake version: version 3.27.2

Is CUDA available: Yes
CUDA runtime version: 11.8.89
GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G
Nvidia driver version: 535.54.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.0

Repro instructions

Please follow the instructions as mentioned here to reproduce this error: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed

Possible Solution

No response

agunapal commented 1 year ago

Hi @sachanub This file is being generated at run time here https://github.com/pytorch/serve/blob/master/ts/handler_utils/distributed/deepspeed.py#L11C14-L11C14

sachanub commented 1 year ago

Hi @agunapal If this file gets generated at run time, then what might be the reason for this error?

Failed to load model opt, exception Cannot copy out of meta tensor; no data!

Is this because of any problematic versions of transformers and deepspeed?

agunapal commented 1 year ago

I have seen this issue before. Let me try tomorrow and get back to you.

sachanub commented 1 year ago

Thank you for the assistance @agunapal !

agunapal commented 1 year ago

Basic question..Have you made sure the snapshot name/folder/path matches the one in model-config.yaml ?

sachanub commented 1 year ago

Yes. I followed these steps:

  1. Ran the following to download the model files:
    python ../utils/Download_model.py --model_path model --model_name facebook/opt-30b --revision main
  2. Started the container with this command inside /serve/examples/large_models/deepspeed:
    docker run -it --rm --gpus=all -v$(pwd):/workspace -v$(pwd)/model:/opt/ml/model 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2 /bin/bash
  3. I set the model path in model-config.yaml to /opt/ml/model/models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546 after verifying that this folder exists in the container.
  4. Created the opt.tar.gz file by running:
    torch-model-archiver --model-name opt --version 1.0 --handler custom_handler.py --extra-files ds-config.json -r requirements.txt --config-file opt/model-config.yaml --archive-format tgz
  5. Moved opt.tar.gz to model_store.
  6. Started TorchServe with:
    torchserve --start --ncs --model-store model_store --models opt.tar.gz

After exactly following these steps, I still get the above mentioned error:

2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Failed to load model opt, exception Cannot copy out of meta tensor; no data!
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     service = model_loader.load(
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/model_loader.py", line 135, in load
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/home/model-server/tmp/models/8e87b4f37759452ca1f376b0c1acb10a/custom_handler.py", line 55, in initialize
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     ds_engine = get_ds_engine(self.model, ctx)
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/handler_utils/distributed/deepspeed.py", line 35, in get_ds_engine
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     ds_engine = deepspeed.init_inference(
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     engine = InferenceEngine(model, config=ds_inference_config)
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 154, in __init__
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     self.module.to(device)
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2053, in to
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     return super().to(*args, **kwargs)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     return self._apply(convert)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     module._apply(fn)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     module._apply(fn)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     module._apply(fn)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     param_applied = fn(param)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - NotImplementedError: Cannot copy out of meta tensor; no data!
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Backend worker process died.
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in <module>
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     worker.run_server()
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -   File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 189, in handle_connection
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG -     raise RuntimeError("{} - {}".format(code, result))
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - RuntimeError: 500 - Unknown exception
sachanub commented 1 year ago

Hi @agunapal . I had a chat with @lxning and she mentioned that I should run DS_BUILD_OPS=1 pip3 install deepspeed to install deepspeed. When I tried that in a container of 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2, I get the following error:

Collecting deepspeed
  Using cached deepspeed-0.10.2.tar.gz (858 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [14 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/model-server/tmp/pip-install-2n_tfsbb/deepspeed_908f51b8c7b5451285bff9683517c68d/setup.py", line 182, in <module>
          abort(f"Unable to pre-compile {op_name}")
        File "/home/model-server/tmp/pip-install-2n_tfsbb/deepspeed_908f51b8c7b5451285bff9683517c68d/setup.py", line 52, in abort
          assert False, msg
      AssertionError: Unable to pre-compile async_io
      DS_BUILD_OPS=1
       [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
       [WARNING]  async_io: please install the libaio-dev package with apt
       [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
       [WARNING]  One can disable async_io with DS_BUILD_AIO=0
       [ERROR]  Unable to pre-compile async_io
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

According to the recommendations from the above mentioned warnings and error, I tried installing deepspeed by running DS_BUILD_AIO=0 DS_BUILD_OPS=1 pip3 install deepspeed. That leads to the following error:

Collecting deepspeed
  Using cached deepspeed-0.10.2.tar.gz (858 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/model-server/tmp/pip-install-iwok_6_k/deepspeed_46b40e3ea3254642ba833898b4f23061/setup.py", line 182, in <module>
          abort(f"Unable to pre-compile {op_name}")
        File "/home/model-server/tmp/pip-install-iwok_6_k/deepspeed_46b40e3ea3254642ba833898b4f23061/setup.py", line 52, in abort
          assert False, msg
      AssertionError: Unable to pre-compile sparse_attn
      DS_BUILD_OPS=1
       [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
       [WARNING]  async_io: please install the libaio-dev package with apt
       [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
       [WARNING]  cpu_adam attempted to use `py-cpuinfo` but failed (exception type: <class 'UnboundLocalError'>, local variable 'get_cpu_info' referenced before assignment), falling back to `lscpu` to get this information.
       [WARNING]  cpu_adam attempted to use `py-cpuinfo` but failed (exception type: <class 'UnboundLocalError'>, local variable 'get_cpu_info' referenced before assignment), falling back to `lscpu` to get this information.
       [WARNING]  cpu_adagrad attempted to use `py-cpuinfo` but failed (exception type: <class 'UnboundLocalError'>, local variable 'get_cpu_info' referenced before assignment), falling back to `lscpu` to get this information.
       [WARNING]  cpu_adagrad attempted to use `py-cpuinfo` but failed (exception type: <class 'UnboundLocalError'>, local variable 'get_cpu_info' referenced before assignment), falling back to `lscpu` to get this information.
       [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
       [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
       [WARNING]  One can disable sparse_attn with DS_BUILD_SPARSE_ATTN=0
       [ERROR]  Unable to pre-compile sparse_attn
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Finally, I tried installing deepspeed by running DS_BUILD_SPARSE_ATTN=0 DS_BUILD_AIO=0 DS_BUILD_OPS=1 pip3 install deepspeed and it again fails with the error as mentioned in the attached file (message was too long to put in the comment). error_log.txt

What would be the recommended procedure for installing deepspeed in this case? Should I use older TorchServe and PyTorch versions (<2.0.1) since one of the warning messages is this:

[WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
agunapal commented 1 year ago

For reference, when I updated the example, this was my config

torch                   2.0.1+cu117
torch-model-archiver    0.8.1b20230622
torch-workflow-archiver 0.2.9b20230622
torchaudio              2.0.2+cu117
torchdata               0.6.1
torchpippy              0.1.1
torchserve              0.8.1b20230622
torchtext               0.15.2+cpu
torchvision             0.15.2+cu117
deepspeed               0.9.4
transformers            4.30.0
agunapal commented 1 year ago

Did a fresh install with these updated versions. Its working for me

torch                   2.0.1+cu117
torch-model-archiver    0.8.2
torch-workflow-archiver 0.2.10
torchaudio              2.0.2+cu117
torchdata               0.6.1
torchserve              0.8.2
torchtext               0.15.2+cpu
torchvision             0.15.2+cu117
transformers            4.33.1
deepspeed               0.10.2

@sachanub The only difference I see is CUDA version. Can you please try with CUDA 11.7

sachanub commented 1 year ago

My container has the following versions:

torch                     2.0.1+cu117
torch-model-archiver      0.8.2
torch-workflow-archiver   0.2.10
torchaudio                2.0.2+cu117
torchdata                 0.6.1
torchserve                0.8.2
torchtext                 0.15.2+cpu
torchvision               0.15.2+cu117
transformers              4.33.1
deepspeed                 0.10.2

With these, the old error went away. However, the worker stops after the checkpoint shards are loaded. Here are the logs:

2023-09-07T07:20:05,064 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards:  86%|████████▌ | 6/7 [07:44<01:17, 77.50s/it]
2023-09-07T07:20:05,064 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:14<00:00, 81.55s/it]
2023-09-07T07:20:05,064 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:14<00:00, 79.25s/it]
2023-09-07T07:20:05,075 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - 
2023-09-07T07:20:05,075 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - checkpoint loading time at rank 3: 554.7325437068939 sec
2023-09-07T07:20:05,075 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:14<00:00, 81.59s/it]
2023-09-07T07:20:05,075 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:14<00:00, 79.25s/it]
2023-09-07T07:20:05,499 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - checkpoint loading time at rank 1: 555.2292792797089 sec
2023-09-07T07:20:05,499 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - 
2023-09-07T07:20:05,499 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:15<00:00, 81.69s/it]
2023-09-07T07:20:05,500 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:15<00:00, 79.32s/it]
2023-09-07T07:20:05,501 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - checkpoint loading time at rank 0: 555.1914641857147 sec
2023-09-07T07:20:05,501 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - 
2023-09-07T07:20:05,501 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:15<00:00, 81.74s/it]
2023-09-07T07:20:05,502 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:15<00:00, 79.31s/it]
2023-09-07T07:20:06,735 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STARTED
2023-09-07T07:20:06,735 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-09-07T07:20:06,735 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:213) [model-server.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
2023-09-07T07:20:06,745 [WARN ] W-29500-opt_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: opt, error: Worker died.
2023-09-07T07:20:06,745 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - W-29500-opt_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-09-07T07:20:06,746 [INFO ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1694071206746
2023-09-07T07:20:06,746 [WARN ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-29500-opt_1.0-stderr
2023-09-07T07:20:06,746 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STARTED
2023-09-07T07:20:06,746 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STARTED
2023-09-07T07:20:06,746 [INFO ] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STOPPED
2023-09-07T07:20:06,746 [WARN ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-29500-opt_1.0-stdout
2023-09-07T07:20:06,747 [INFO ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 29500 in 1 seconds.
2023-09-07T07:20:06,760 [INFO ] W-29500-opt_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-29500-opt_1.0-stderr
2023-09-07T07:20:06,760 [INFO ] W-29500-opt_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-29500-opt_1.0-stdout
2023-09-07T07:20:07,747 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [torchrun, --nnodes, 1, --nproc-per-node, 4, --log-dir, /workspace/logs/torchelastic_ts, --rdzv-backend, c10d, --rdzv-id, opt_29500, --max-restarts, 1, /opt/conda/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.29500, --metrics-config, /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-09-07T07:20:09,267 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   entrypoint       : /opt/conda/lib/python3.9/site-packages/ts/model_service_worker.py
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   min_nodes        : 1
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   max_nodes        : 1
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   nproc_per_node   : 4
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   run_id           : opt_29500
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   rdzv_backend     : c10d
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   rdzv_endpoint    : 
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   rdzv_configs     : {'timeout': 900}
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   max_restarts     : 1
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   monitor_interval : 5
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   log_dir          : /workspace/logs/torchelastic_ts
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   metrics_cfg      : {}
2023-09-07T07:20:09,270 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - 
2023-09-07T07:20:09,270 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7
2023-09-07T07:20:09,270 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.9
2023-09-07T07:20:09,270 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   restart_count=0
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   master_addr=05b5b93d08ea
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   master_port=56755
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   group_rank=0
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   group_world_size=1
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   local_ranks=[0, 1, 2, 3]
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   role_ranks=[0, 1, 2, 3]
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   global_ranks=[0, 1, 2, 3]
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   role_world_sizes=[4, 4, 4, 4]
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -   global_world_sizes=[4, 4, 4, 4]
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - 
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7/attempt_0/0/error.json
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7/attempt_0/1/error.json
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7/attempt_0/2/error.json
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7/attempt_0/3/error.json
2023-09-07T07:20:10,532 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=29500, pid=1226
2023-09-07T07:20:10,533 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.29500
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - [PID]1226
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Torch worker started.
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Python runtime: 3.9.13
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=29500, pid=1229
2023-09-07T07:20:10,544 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.29503
2023-09-07T07:20:10,551 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=29500, pid=1228
2023-09-07T07:20:10,551 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.29502
2023-09-07T07:20:10,553 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=29500, pid=1227
2023-09-07T07:20:10,553 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.29501
2023-09-07T07:20:10,554 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-09-07T07:20:10,554 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - [PID]1229
2023-09-07T07:20:10,554 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Torch worker started.
2023-09-07T07:20:10,554 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Python runtime: 3.9.13
2023-09-07T07:20:10,561 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-09-07T07:20:10,561 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - [PID]1228
2023-09-07T07:20:10,561 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Torch worker started.
2023-09-07T07:20:10,561 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Python runtime: 3.9.13
2023-09-07T07:20:10,563 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-09-07T07:20:10,564 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - [PID]1227
2023-09-07T07:20:10,564 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Torch worker started.
2023-09-07T07:20:10,564 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Python runtime: 3.9.13
2023-09-07T07:20:10,564 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - W-29500-opt_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-09-07T07:20:10,564 [INFO ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.29500
2023-09-07T07:20:10,565 [INFO ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.29501
2023-09-07T07:20:10,565 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STARTED
2023-09-07T07:20:10,566 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.29500.
2023-09-07T07:20:10,566 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-09-07T07:20:10,566 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: AbstractBootstrap$PendingRegistrationPromise@36fdc4c2(incomplete)
        at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:243) ~[model-server.jar:?]
        at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131) ~[model-server.jar:?]
        at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30) ~[model-server.jar:?]
        at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403) ~[model-server.jar:?]
        at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119) ~[model-server.jar:?]
        at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:367) ~[model-server.jar:?]
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
agunapal commented 1 year ago

Its hard to say from these logs why its failing. The worker is dying. May be monitor the memory/Gpu memory.

I am using A10G and it loads faster than what I see in your logs. I haven't tested this on T4 or V100

checkpoint loading time at rank 0: 307.3796021938324 sec

sachanub commented 1 year ago

I am also using the A10G (using a g5.24xlarge EC2 instance). In the beginning, while the checkpoints are loading, I get the following output from nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   31C    P0              59W / 300W |  15404MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   30C    P0              58W / 300W |  15404MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   31C    P0              59W / 300W |  15404MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   31C    P0              61W / 300W |  15404MiB / 23028MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

However, after the checkpoints load and the worker dies, I get this output from nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   25C    P8              15W / 300W |      5MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   25C    P8              15W / 300W |      5MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   30C    P0              59W / 300W |  15474MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   31C    P0              61W / 300W |  15474MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Can you please share your exact setup with me? Which EC2 instance are you using? Are you running this example from a Docker container? If yes, can you share the URI of the base container on top of which you installed the other dependencies like deepspeed, transformers etc? Thanks!

sachanub commented 1 year ago

Hi @agunapal . I was able to run the deepspeed example on the g5.24xlarge EC2 instance by installing all the required dependencies on Ubuntu 20.04. This included the installation of CUDA toolkit 11.7 and the other dependencies listed in your comment above. Please note that running the example with the same setup on a Docker container leads to the workers dying after the model and the checkpoints are loaded (as indicated in the logs in the comment above). I think it is important to investigate why this example does not work when used with the Docker container. Also, it would be very helpful if we can update the PR to list out the exact versions of the dependencies (not just transformers and deepspeed. There were issues for this example with CUDA 11.8 which were resolved upon running with CUDA 11.7) on the Github example. This could help other customers avoid any issues which I faced while running this example. Thanks!