Open sachanub opened 1 year ago
Hi @sachanub This file is being generated at run time here https://github.com/pytorch/serve/blob/master/ts/handler_utils/distributed/deepspeed.py#L11C14-L11C14
Hi @agunapal If this file gets generated at run time, then what might be the reason for this error?
Failed to load model opt, exception Cannot copy out of meta tensor; no data!
Is this because of any problematic versions of transformers
and deepspeed
?
I have seen this issue before. Let me try tomorrow and get back to you.
Thank you for the assistance @agunapal !
Basic question..Have you made sure the snapshot name/folder/path matches the one in model-config.yaml ?
Yes. I followed these steps:
python ../utils/Download_model.py --model_path model --model_name facebook/opt-30b --revision main
/serve/examples/large_models/deepspeed
:
docker run -it --rm --gpus=all -v$(pwd):/workspace -v$(pwd)/model:/opt/ml/model 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2 /bin/bash
/opt/ml/model/models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546
after verifying that this folder exists in the container. opt.tar.gz
file by running:
torch-model-archiver --model-name opt --version 1.0 --handler custom_handler.py --extra-files ds-config.json -r requirements.txt --config-file opt/model-config.yaml --archive-format tgz
opt.tar.gz
to model_store
.torchserve --start --ncs --model-store model_store --models opt.tar.gz
After exactly following these steps, I still get the above mentioned error:
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Failed to load model opt, exception Cannot copy out of meta tensor; no data!
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 131, in load_model
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - service = model_loader.load(
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_loader.py", line 135, in load
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - initialize_fn(service.context)
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/home/model-server/tmp/models/8e87b4f37759452ca1f376b0c1acb10a/custom_handler.py", line 55, in initialize
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - ds_engine = get_ds_engine(self.model, ctx)
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/handler_utils/distributed/deepspeed.py", line 35, in get_ds_engine
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - ds_engine = deepspeed.init_inference(
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - engine = InferenceEngine(model, config=ds_inference_config)
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 154, in __init__
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - self.module.to(device)
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2053, in to
2023-09-06T04:41:57,904 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - return super().to(*args, **kwargs)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - return self._apply(convert)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - module._apply(fn)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - module._apply(fn)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - module._apply(fn)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - param_applied = fn(param)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2023-09-06T04:41:57,905 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - NotImplementedError: Cannot copy out of meta tensor; no data!
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Backend worker process died.
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 253, in <module>
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - worker.run_server()
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 221, in run_server
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.10/site-packages/ts/model_service_worker.py", line 189, in handle_connection
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - raise RuntimeError("{} - {}".format(code, result))
2023-09-06T04:41:57,911 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - RuntimeError: 500 - Unknown exception
Hi @agunapal . I had a chat with @lxning and she mentioned that I should run DS_BUILD_OPS=1 pip3 install deepspeed
to install deepspeed. When I tried that in a container of 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2
, I get the following error:
Collecting deepspeed
Using cached deepspeed-0.10.2.tar.gz (858 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [14 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/home/model-server/tmp/pip-install-2n_tfsbb/deepspeed_908f51b8c7b5451285bff9683517c68d/setup.py", line 182, in <module>
abort(f"Unable to pre-compile {op_name}")
File "/home/model-server/tmp/pip-install-2n_tfsbb/deepspeed_908f51b8c7b5451285bff9683517c68d/setup.py", line 52, in abort
assert False, msg
AssertionError: Unable to pre-compile async_io
DS_BUILD_OPS=1
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] One can disable async_io with DS_BUILD_AIO=0
[ERROR] Unable to pre-compile async_io
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
According to the recommendations from the above mentioned warnings and error, I tried installing deepspeed
by running DS_BUILD_AIO=0 DS_BUILD_OPS=1 pip3 install deepspeed
. That leads to the following error:
Collecting deepspeed
Using cached deepspeed-0.10.2.tar.gz (858 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/home/model-server/tmp/pip-install-iwok_6_k/deepspeed_46b40e3ea3254642ba833898b4f23061/setup.py", line 182, in <module>
abort(f"Unable to pre-compile {op_name}")
File "/home/model-server/tmp/pip-install-iwok_6_k/deepspeed_46b40e3ea3254642ba833898b4f23061/setup.py", line 52, in abort
assert False, msg
AssertionError: Unable to pre-compile sparse_attn
DS_BUILD_OPS=1
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] cpu_adam attempted to use `py-cpuinfo` but failed (exception type: <class 'UnboundLocalError'>, local variable 'get_cpu_info' referenced before assignment), falling back to `lscpu` to get this information.
[WARNING] cpu_adam attempted to use `py-cpuinfo` but failed (exception type: <class 'UnboundLocalError'>, local variable 'get_cpu_info' referenced before assignment), falling back to `lscpu` to get this information.
[WARNING] cpu_adagrad attempted to use `py-cpuinfo` but failed (exception type: <class 'UnboundLocalError'>, local variable 'get_cpu_info' referenced before assignment), falling back to `lscpu` to get this information.
[WARNING] cpu_adagrad attempted to use `py-cpuinfo` but failed (exception type: <class 'UnboundLocalError'>, local variable 'get_cpu_info' referenced before assignment), falling back to `lscpu` to get this information.
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
[WARNING] One can disable sparse_attn with DS_BUILD_SPARSE_ATTN=0
[ERROR] Unable to pre-compile sparse_attn
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Finally, I tried installing deepspeed
by running DS_BUILD_SPARSE_ATTN=0 DS_BUILD_AIO=0 DS_BUILD_OPS=1 pip3 install deepspeed
and it again fails with the error as mentioned in the attached file (message was too long to put in the comment).
error_log.txt
What would be the recommended procedure for installing deepspeed in this case? Should I use older TorchServe and PyTorch versions (<2.0.1) since one of the warning messages is this:
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
For reference, when I updated the example, this was my config
torch 2.0.1+cu117
torch-model-archiver 0.8.1b20230622
torch-workflow-archiver 0.2.9b20230622
torchaudio 2.0.2+cu117
torchdata 0.6.1
torchpippy 0.1.1
torchserve 0.8.1b20230622
torchtext 0.15.2+cpu
torchvision 0.15.2+cu117
deepspeed 0.9.4
transformers 4.30.0
Did a fresh install with these updated versions. Its working for me
torch 2.0.1+cu117
torch-model-archiver 0.8.2
torch-workflow-archiver 0.2.10
torchaudio 2.0.2+cu117
torchdata 0.6.1
torchserve 0.8.2
torchtext 0.15.2+cpu
torchvision 0.15.2+cu117
transformers 4.33.1
deepspeed 0.10.2
@sachanub The only difference I see is CUDA version. Can you please try with CUDA 11.7
My container has the following versions:
torch 2.0.1+cu117
torch-model-archiver 0.8.2
torch-workflow-archiver 0.2.10
torchaudio 2.0.2+cu117
torchdata 0.6.1
torchserve 0.8.2
torchtext 0.15.2+cpu
torchvision 0.15.2+cu117
transformers 4.33.1
deepspeed 0.10.2
With these, the old error went away. However, the worker stops after the checkpoint shards are loaded. Here are the logs:
2023-09-07T07:20:05,064 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 86%|████████▌ | 6/7 [07:44<01:17, 77.50s/it]
2023-09-07T07:20:05,064 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:14<00:00, 81.55s/it]
2023-09-07T07:20:05,064 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:14<00:00, 79.25s/it]
2023-09-07T07:20:05,075 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -
2023-09-07T07:20:05,075 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - checkpoint loading time at rank 3: 554.7325437068939 sec
2023-09-07T07:20:05,075 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:14<00:00, 81.59s/it]
2023-09-07T07:20:05,075 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:14<00:00, 79.25s/it]
2023-09-07T07:20:05,499 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - checkpoint loading time at rank 1: 555.2292792797089 sec
2023-09-07T07:20:05,499 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -
2023-09-07T07:20:05,499 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:15<00:00, 81.69s/it]
2023-09-07T07:20:05,500 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:15<00:00, 79.32s/it]
2023-09-07T07:20:05,501 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - checkpoint loading time at rank 0: 555.1914641857147 sec
2023-09-07T07:20:05,501 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -
2023-09-07T07:20:05,501 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:15<00:00, 81.74s/it]
2023-09-07T07:20:05,502 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Loading 7 checkpoint shards: 100%|██████████| 7/7 [09:15<00:00, 79.31s/it]
2023-09-07T07:20:06,735 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STARTED
2023-09-07T07:20:06,735 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-09-07T07:20:06,735 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:213) [model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2023-09-07T07:20:06,745 [WARN ] W-29500-opt_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: opt, error: Worker died.
2023-09-07T07:20:06,745 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - W-29500-opt_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2023-09-07T07:20:06,746 [INFO ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery start timestamp: 1694071206746
2023-09-07T07:20:06,746 [WARN ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-29500-opt_1.0-stderr
2023-09-07T07:20:06,746 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STARTED
2023-09-07T07:20:06,746 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STARTED
2023-09-07T07:20:06,746 [INFO ] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STOPPED
2023-09-07T07:20:06,746 [WARN ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-29500-opt_1.0-stdout
2023-09-07T07:20:06,747 [INFO ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 29500 in 1 seconds.
2023-09-07T07:20:06,760 [INFO ] W-29500-opt_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-29500-opt_1.0-stderr
2023-09-07T07:20:06,760 [INFO ] W-29500-opt_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-29500-opt_1.0-stdout
2023-09-07T07:20:07,747 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [torchrun, --nnodes, 1, --nproc-per-node, 4, --log-dir, /workspace/logs/torchelastic_ts, --rdzv-backend, c10d, --rdzv-id, opt_29500, --max-restarts, 1, /opt/conda/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.29500, --metrics-config, /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-09-07T07:20:09,267 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - entrypoint : /opt/conda/lib/python3.9/site-packages/ts/model_service_worker.py
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - min_nodes : 1
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - max_nodes : 1
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - nproc_per_node : 4
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - run_id : opt_29500
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - rdzv_backend : c10d
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - rdzv_endpoint :
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - rdzv_configs : {'timeout': 900}
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - max_restarts : 1
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - monitor_interval : 5
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - log_dir : /workspace/logs/torchelastic_ts
2023-09-07T07:20:09,268 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - metrics_cfg : {}
2023-09-07T07:20:09,270 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -
2023-09-07T07:20:09,270 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7
2023-09-07T07:20:09,270 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.9
2023-09-07T07:20:09,270 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - restart_count=0
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - master_addr=05b5b93d08ea
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - master_port=56755
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - group_rank=0
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - group_world_size=1
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - local_ranks=[0, 1, 2, 3]
2023-09-07T07:20:09,298 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - role_ranks=[0, 1, 2, 3]
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - global_ranks=[0, 1, 2, 3]
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - role_world_sizes=[4, 4, 4, 4]
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - global_world_sizes=[4, 4, 4, 4]
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG -
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7/attempt_0/0/error.json
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7/attempt_0/1/error.json
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7/attempt_0/2/error.json
2023-09-07T07:20:09,299 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /workspace/logs/torchelastic_ts/opt_29500_7ke_yeu7/attempt_0/3/error.json
2023-09-07T07:20:10,532 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=29500, pid=1226
2023-09-07T07:20:10,533 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.29500
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - [PID]1226
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Torch worker started.
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Python runtime: 3.9.13
2023-09-07T07:20:10,543 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=29500, pid=1229
2023-09-07T07:20:10,544 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.29503
2023-09-07T07:20:10,551 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=29500, pid=1228
2023-09-07T07:20:10,551 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.29502
2023-09-07T07:20:10,553 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=29500, pid=1227
2023-09-07T07:20:10,553 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.29501
2023-09-07T07:20:10,554 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-09-07T07:20:10,554 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - [PID]1229
2023-09-07T07:20:10,554 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Torch worker started.
2023-09-07T07:20:10,554 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Python runtime: 3.9.13
2023-09-07T07:20:10,561 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-09-07T07:20:10,561 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - [PID]1228
2023-09-07T07:20:10,561 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Torch worker started.
2023-09-07T07:20:10,561 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Python runtime: 3.9.13
2023-09-07T07:20:10,563 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-09-07T07:20:10,564 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - [PID]1227
2023-09-07T07:20:10,564 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Torch worker started.
2023-09-07T07:20:10,564 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Python runtime: 3.9.13
2023-09-07T07:20:10,564 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - W-29500-opt_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2023-09-07T07:20:10,564 [INFO ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.29500
2023-09-07T07:20:10,565 [INFO ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.29501
2023-09-07T07:20:10,565 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 29500 Worker disconnected. WORKER_STARTED
2023-09-07T07:20:10,566 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.29500.
2023-09-07T07:20:10,566 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2023-09-07T07:20:10,566 [DEBUG] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: AbstractBootstrap$PendingRegistrationPromise@36fdc4c2(incomplete)
at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:243) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30) ~[model-server.jar:?]
at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119) ~[model-server.jar:?]
at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:367) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:183) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Its hard to say from these logs why its failing. The worker is dying. May be monitor the memory/Gpu memory.
I am using A10G and it loads faster than what I see in your logs. I haven't tested this on T4 or V100
checkpoint loading time at rank 0: 307.3796021938324 sec
I am also using the A10G (using a g5.24xlarge EC2 instance). In the beginning, while the checkpoints are loading, I get the following output from nvidia-smi
:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 31C P0 59W / 300W | 15404MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 30C P0 58W / 300W | 15404MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 31C P0 59W / 300W | 15404MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 31C P0 61W / 300W | 15404MiB / 23028MiB | 4% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
However, after the checkpoints load and the worker dies, I get this output from nvidia-smi
:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 25C P8 15W / 300W | 5MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 25C P8 15W / 300W | 5MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 30C P0 59W / 300W | 15474MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 31C P0 61W / 300W | 15474MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Can you please share your exact setup with me? Which EC2 instance are you using? Are you running this example from a Docker container? If yes, can you share the URI of the base container on top of which you installed the other dependencies like deepspeed
, transformers
etc? Thanks!
Hi @agunapal . I was able to run the deepspeed example on the g5.24xlarge EC2 instance by installing all the required dependencies on Ubuntu 20.04. This included the installation of CUDA toolkit 11.7 and the other dependencies listed in your comment above. Please note that running the example with the same setup on a Docker container leads to the workers dying after the model and the checkpoints are loaded (as indicated in the logs in the comment above). I think it is important to investigate why this example does not work when used with the Docker container. Also, it would be very helpful if we can update the PR to list out the exact versions of the dependencies (not just transformers and deepspeed. There were issues for this example with CUDA 11.8 which were resolved upon running with CUDA 11.7) on the Github example. This could help other customers avoid any issues which I faced while running this example. Thanks!
🐛 Describe the bug
I am trying to follow the example to perform inference with the OPT-30B model according to this example: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed
However, as specified in the model-config.yaml file, a
checkpoints.json
file is required. This file gets used here: https://github.com/pytorch/serve/blob/master/ts/handler_utils/distributed/deepspeed.py#L40As a result, the model fails to load. The error logs are attached below.
Error logs
Installation instructions
Docker image URI:
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2
EC2 instance:g5dn.24xlarge
Model Packaing
Created model artifact by following this example: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed
config.properties
No response
Versions
Repro instructions
Please follow the instructions as mentioned here to reproduce this error: https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed
Possible Solution
No response