Open consciousgaze opened 1 year ago
@HamidShojanazeri Have we tried pippy in a docker container? I am wondering if it is initializing local rank at all?
Is there any update?
@consciousgaze according to the log, torchrun is not started.
Hi,
I have updated the model-config.yaml
and tried again.
The model-config.yaml
I use is
#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "pp"
deviceType: "gpu"
torchrun:
nproc-per-node: 1
#backend settings
pippy:
rpc_timeout: 1800
model_type: "HF"
chunks: 1
input_names: ["input_ids"]
num_worker_threads: 128
handler:
model_path: "/app/serve/examples/large_models/Huggingface_pippy/model/models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546"
index_filename: 'pytorch_model.bin.index.json'
max_length: 50
max_new_tokens: 60
manual_seed: 40
dtype: fp16
It still fails for KeyError: 'LOCAL_RANK'
But I found a way to get it go through. If I change nproc-per-node: 1
to nproc-per-node: 2
, the model prepare will finish. Is the nproc-per-node
also controlling whether to use torchrun
?
But I found a way to get it go through. If I change nproc-per-node: 1 to nproc-per-node: 2, the model prepare will finish. Is the nproc-per-node also controlling whether to use torchrun?
@consciousgaze I'm also looking into a similar issue that LOCAL_RANK is unavailable in a worker, and I happened to find this PR: https://github.com/pytorch/serve/pull/2608, and it looks related to the behavior. v0.9.0 contains the PR's change. Did you try that version?
[UPDATED] I could get LOCAL_RANK
with v0.9.0. In my case, the config.yaml was missing parallelType
. As a result, parallelLevel was not set (code).
π Describe the bug
I used the docker image of latest torchserve (https://hub.docker.com/layers/pytorch/torchserve/0.8.2-gpu/images/sha256-563e3d46b33091cdf1751e56387dfcc07fe8a8360343235d13489eb60c41f1f5?context=explore) to run the example large model of huggingface pippy opt model. I followed exactly the same process described in the example. But I got 'LOCAL_RANK' not found error
Error logs
Installation instructions
I am using docker. My docker file looke like:
Model Packaing
I used the command in the example to packge the model:
torch-model-archiver --model-name opt --version 1.0 --handler pippy_handler.py -r requirements.txt --config-file model-config.yaml --archive-format tgz
The only thing that amy worth mentioning is that I set
nproc-per-node: 1
in model-config.yaml since i only want to use one gpu.config.properties
I didn't specify config.properties. it's the default one in the docker. it looks like
Versions
Repro instructions
Possible Solution
No response