Closed muellerzr closed 1 year ago
Same issue with transformers==4.28.1, deepspeed==0.9.1 and accelerate==0.18.0
In main transformers folder, run:
CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" pytest -sv tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_bf16
@pacman100 The test you shared passes for me. I've matched the version of each package you listed. Could you double check that this test is failing for you?
CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" pytest -sv tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_bf16
UPDATE:
Nevermind, I didn't see there was a branch for transformers I should be on: muellerzr-bring-deepspeed-back
Able to replicate on my side - working on a fix.
pip list
absl-py 1.4.0
accelerate 0.18.0
addict 2.4.0
aiofiles 22.1.0
aiohttp 3.8.3
aiosignal 1.2.0
aiosqlite 0.18.0
altair 4.2.0
ansible 7.1.0
ansible-core 2.14.1
ansible-vault 2.1.0
antlr4-python3-runtime 4.9.3
anyio 3.6.2
apex 0.1
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asttokens 2.0.8
async-timeout 4.0.2
attrs 22.1.0
audioread 3.0.0
Babel 2.11.0
backcall 0.2.0
backoff 2.2.1
base58 2.1.1
beautifulsoup4 4.11.1
bertviz 1.4.0
binaryornot 0.4.4
bitsandbytes 0.37.0
black 23.3.0
bleach 5.0.1
blessed 1.20.0
bokeh 2.4.3
boto3 1.26.64
botocore 1.29.64
Brotli 1.0.9
brotlipy 0.7.0
cachetools 5.2.0
certifi 2022.12.7
cffi 1.15.1
chardet 5.1.0
charset-normalizer 2.1.1
click 8.1.3
cmake 3.25.0
codecov 2.1.12
colorama 0.4.4
colorcet 3.0.1
coloredlogs 15.0.1
commonmark 0.9.1
contourpy 1.0.5
cookiecutter 2.1.1
coverage 6.5.0
coveralls 3.3.1
cryptography 37.0.1
cycler 0.11.0
cytoolz 0.12.0
datasets 2.9.0
debugpy 1.6.3
decorator 5.1.1
deepspeed 0.9.1
defusedxml 0.7.1
diffusers 0.12.0.dev0 /home/sourab/diffusers
dill 0.3.5.1
docker-pycreds 0.4.0
docopt 0.6.2
docutils 0.16
ecdsa 0.18.0
entrypoints 0.4
eth-hash 0.5.1
eth-keys 0.4.0
eth-typing 3.2.0
eth-utils 2.1.0
evaluate 0.2.2
exceptiongroup 1.0.4
execnet 1.9.0
executing 1.1.0
fastapi 0.89.1
fastjsonschema 2.16.2
ffmpy 0.3.0
filelock 3.9.0
fire 0.5.0
flatbuffers 23.1.21
flexgen 0.1.7
flit_core 3.6.0
fonttools 4.37.3
fqdn 1.5.1
frozenlist 1.3.1
fsspec 2022.8.2
ftfy 6.1.1
fuzzywuzzy 0.18.0
gitdb 4.0.9
GitPython 3.1.27
gmpy2 2.1.2
google-api-core 2.8.2
google-api-python-client 2.69.0
google-auth 2.15.0
google-auth-httplib2 0.1.0
googleapis-common-protos 1.56.4
gpustat 1.1
gradio 3.25.0
gradio_client 0.1.3
grpcio 1.42.0
grpcio-tools 1.42.0
h11 0.14.0
hjson 3.1.0
holoviews 1.15.4
httpcore 0.16.3
httplib2 0.21.0
httpx 0.23.3
huggingface-hub 0.13.4
humanfriendly 10.0
hydra-core 1.3.0
hypothesis 6.61.0
idna 3.4
importlib-metadata 6.0.0
inflate64 0.3.1
iniconfig 1.1.1
ipykernel 6.16.0
ipython 8.5.0
ipython-genutils 0.2.0
ipywidgets 8.0.2
isoduration 20.11.0
jedi 0.18.1
Jinja2 3.1.2
jinja2-time 0.2.0
jiwer 2.5.1
jmespath 1.0.1
joblib 1.2.0
json5 0.9.11
jsonpointer 2.3
jsonschema 4.17.3
jupyter 1.0.0
jupyter_client 8.0.2
jupyter-console 6.4.4
jupyter_core 5.2.0
jupyter-events 0.5.0
jupyter_server 2.2.1
jupyter_server_fileid 0.6.0
jupyter_server_terminals 0.4.4
jupyter_server_ydoc 0.6.1
jupyter-ydoc 0.2.2
jupyterlab 3.6.1
jupyterlab-pygments 0.2.2
jupyterlab_server 2.19.0
jupyterlab-widgets 3.0.3
kiwisolver 1.4.4
Levenshtein 0.20.2
librosa 0.9.2
linkify-it-py 1.0.3
lit 15.0.7
llama-cpp-python 0.1.34
llvmlite 0.39.1
loguru 0.6.0
loralib 0.1.1
lxml 4.9.1
Markdown 3.4.3
markdown-it-py 2.1.0
MarkupSafe 2.1.1
matplotlib 3.6.0
matplotlib-inline 0.1.6
mdit-py-plugins 0.3.3
mdurl 0.1.2
megatron-lm 3.0.0 /home/sourab/Megatron-LM
miniupnpc 2.0.2
mistune 2.0.4
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
more-itertools 9.0.0
mpmath 1.2.1
msgpack 1.0.4
msgpack-numpy 0.4.7.1
multidict 6.0.2
multiprocess 0.70.13
multivolumefile 0.2.3
munch 2.5.0
mypy-extensions 1.0.0
nbclassic 0.5.1
nbclient 0.6.8
nbconvert 7.0.0
nbformat 5.6.1
nest-asyncio 1.5.5
netaddr 0.8.0
networkx 3.0rc1
ninja 1.10.2.3
nltk 3.8.1
notebook 6.4.12
notebook_shim 0.2.2
numba 0.56.4
numpy 1.24.1
nvidia-ml-py 11.525.112
omegaconf 2.3.0
onnx 1.13.0
onnxruntime-gpu 1.13.1
optimum 1.8.2
orjson 3.8.5
packaging 23.1
pandarallel 1.6.3
pandas 1.5.0
pandocfilters 1.5.0
panel 0.14.4
param 1.13.0
parameterized 0.8.1
parso 0.8.3
password-strength 0.0.3.post2
pathspec 0.11.1
pathtools 0.1.2
peft 0.3.0.dev0 /home/sourab/pet
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.5.0
pip 23.0.1
platformdirs 2.6.2
pluggy 1.0.0
pooch 1.6.0
portalocker 2.5.1
prometheus-client 0.14.1
promise 2.3
prompt-toolkit 3.0.31
protobuf 3.20.2
psutil 5.9.2
ptyprocess 0.7.0
PuLP 2.7.0
pure-eval 0.2.2
py 1.11.0
py-bip39-bindings 0.1.10
py-cpuinfo 8.0.0
py-ed25519-bindings 1.0.2
py-sr25519-bindings 0.2.0
py7zr 0.20.2
pyarrow 9.0.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pybcj 1.0.1
pybind11 2.10.0
pycparser 2.21
pycryptodome 3.11.0
pycryptodomex 3.16.0
pyct 0.5.0
pydantic 1.10.2
pydub 0.25.1
Pygments 2.14.0
pyOpenSSL 22.0.0
pyparsing 3.0.9
pyppmd 1.0.0
pyrsistent 0.18.1
PySocks 1.7.1
pytesseract 0.3.10
pytest 7.2.0
pytest-cov 4.0.0
pytest-rerunfailures 10.3
pytest-split 0.8.0
pytest-xdist 3.1.0
python-dateutil 2.8.2
python-json-logger 2.0.4
python-Levenshtein 0.12.1
python-multipart 0.0.5
python-slugify 7.0.0
pytorch-triton 2.0.0+b8b470bc59
pytz 2022.2.1
pyviz-comms 2.2.1
PyYAML 5.4.1
pyzmq 24.0.1
pyzstd 0.15.3
qqdm 0.0.7
qtconsole 5.3.2
QtPy 2.2.0
rapidfuzz 2.13.7
regex 2022.9.13
requests 2.28.1
resampy 0.4.2
resolvelib 0.8.1
responses 0.18.0
retry 0.9.2
rfc3339-validator 0.1.4
rfc3986 1.5.0
rfc3986-validator 0.1.1
rich 13.3.1
rouge-score 0.1.2
rsa 4.7.2
rwkv 0.7.3
s3transfer 0.6.0
sacrebleu 2.2.1
safetensors 0.3.0
scalecodec 1.0.48
scikit-learn 1.1.3
scipy 1.9.1
seaborn 0.12.2
semantic-version 2.10.0
Send2Trash 1.8.0
sentencepiece 0.1.97
sentry-sdk 1.9.9
seqeval 1.2.2
setproctitle 1.3.2
setuptools 63.4.1
shortuuid 1.0.9
six 1.16.0
smmap 5.0.0
sniffio 1.3.0
sortedcontainers 2.4.0
soundfile 0.11.0
soupsieve 2.3.2.post1
stack-data 0.5.1
starlette 0.22.0
substrate-interface 1.2.4
sympy 1.11.1
tabulate 0.8.10
termcolor 2.1.1
terminado 0.15.0
text-unidecode 1.3
texttable 1.6.7
threadpoolctl 3.1.0
tinycss2 1.1.1
tokenize-rt 5.0.0
tokenizer 3.4.2
tokenizers 0.13.3
tomli 2.0.1
toolz 0.12.0
torch 2.0.0
torchaudio 2.0.0
torchvision 0.15.0
tornado 6.2
tqdm 4.64.1
traitlets 5.9.0
transformers 4.28.1
triton 2.0.0
trl 0.2.2.dev0 /home/sourab/trl
typer 0.7.0
typing_extensions 4.5.0
uc-micro-py 1.0.1
uri-template 1.2.0
uritemplate 4.1.1
urllib3 1.26.14
uvicorn 0.20.0
wandb 0.13.3
wcwidth 0.2.5
webcolors 1.12
webencodings 0.5.1
websocket-client 1.4.2
websockets 10.4
wheel 0.37.1
widgetsnbextension 4.0.3
xxhash 2.0.2
y-py 0.5.5
yarl 1.8.1
ypy-websocket 0.8.2
zipp 3.11.0
Command and output:
CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" pytest -sv tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_bf16
======================================= test session starts ========================================
platform linux -- Python 3.10.4, pytest-7.2.0, pluggy-1.0.0 -- /home/sourab/miniconda3/envs/ml/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/sourab/transformers/.hypothesis/examples')
rootdir: /home/sourab/transformers, configfile: setup.cfg
plugins: anyio-3.6.2, rerunfailures-10.3, xdist-3.1.0, hypothesis-6.61.0, split-0.8.0, cov-4.0.0, hydra-core-1.3.0
collecting ...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
collected 1 item
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_bf16
Running: deepspeed --num_nodes 1 --num_gpus 2 --master_port 10999 /home/sourab/transformers/examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --train_file /home/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/train.json --validation_file /home/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/val.json --output_dir /tmp/tmpfc7yfgub --overwrite_output_dir --max_source_length 32 --max_target_length 32 --val_max_target_length 32 --warmup_steps 8 --predict_with_generate --save_steps 0 --eval_steps 10 --group_by_length --label_smoothing_factor 0.1 --source_lang en --target_lang ro --report_to none --source_prefix "translate English to Romanian: " --bf16 --do_train --num_train_epochs 1 --max_train_samples 16 --per_device_train_batch_size 2 --learning_rate 3e-3 --do_eval --max_eval_samples 16 --per_device_eval_batch_size 2 --deepspeed /home/sourab/transformers/tests/deepspeed/ds_config_zero3.json
stdout: [2023-04-21 23:15:37,900] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
stdout: Detected CUDA_VISIBLE_DEVICES=0,1 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
stdout: [2023-04-21 23:15:37,940] [INFO] [runner.py:540:main] cmd = /home/sourab/miniconda3/envs/ml/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=10999 --enable_each_rank_log=None /home/sourab/transformers/examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --train_file /home/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/train.json --validation_file /home/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/val.json --output_dir /tmp/tmpfc7yfgub --overwrite_output_dir --max_source_length 32 --max_target_length 32 --val_max_target_length 32 --warmup_steps 8 --predict_with_generate --save_steps 0 --eval_steps 10 --group_by_length --label_smoothing_factor 0.1 --source_lang en --target_lang ro --report_to none --source_prefix "translate English to Romanian: " --bf16 --do_train --num_train_epochs 1 --max_train_samples 16 --per_device_train_batch_size 2 --learning_rate 3e-3 --do_eval --max_eval_samples 16 --per_device_eval_batch_size 2 --deepspeed /home/sourab/transformers/tests/deepspeed/ds_config_zero3.json
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1]}
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=2, node_rank=0
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:247:main] dist_world_size=2
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1
stdout: 04/21/2023 23:15:44 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
stdout: 04/21/2023 23:15:44 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
stdout: 04/21/2023 23:15:44 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
stdout: _n_gpu=1,
stdout: adafactor=False,
stdout: adam_beta1=0.9,
stdout: adam_beta2=0.999,
stdout: adam_epsilon=1e-08,
stdout: auto_find_batch_size=False,
stdout: bf16=True,
stdout: bf16_full_eval=False,
stdout: data_seed=None,
stdout: dataloader_drop_last=False,
stdout: dataloader_num_workers=0,
stdout: dataloader_pin_memory=True,
stdout: ddp_bucket_cap_mb=None,
stdout: ddp_find_unused_parameters=None,
stdout: ddp_timeout=1800,
stdout: debug=[],
stdout: deepspeed=/home/sourab/transformers/tests/deepspeed/ds_config_zero3.json,
stdout: disable_tqdm=False,
stdout: do_eval=True,
stdout: do_predict=False,
stdout: do_train=True,
stdout: eval_accumulation_steps=None,
stdout: eval_delay=0,
stdout: eval_steps=10,
stdout: evaluation_strategy=no,
stdout: fp16=False,
stdout: fp16_backend=auto,
stdout: fp16_full_eval=False,
stdout: fp16_opt_level=O1,
stdout: fsdp=[],
stdout: fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
stdout: fsdp_min_num_params=0,
stdout: fsdp_transformer_layer_cls_to_wrap=None,
stdout: full_determinism=False,
stdout: generation_config=None,
stdout: generation_max_length=None,
stdout: generation_num_beams=None,
stdout: gradient_accumulation_steps=1,
stdout: gradient_checkpointing=False,
stdout: greater_is_better=None,
stdout: group_by_length=True,
stdout: half_precision_backend=auto,
stdout: hub_model_id=None,
stdout: hub_private_repo=False,
stdout: hub_strategy=every_save,
stdout: hub_token=<HUB_TOKEN>,
stdout: ignore_data_skip=False,
stdout: include_inputs_for_metrics=False,
stdout: jit_mode_eval=False,
stdout: label_names=None,
stdout: label_smoothing_factor=0.1,
stdout: learning_rate=0.003,
stdout: length_column_name=length,
stdout: load_best_model_at_end=False,
stdout: local_rank=0,
stdout: log_level=passive,
stdout: log_level_replica=warning,
stdout: log_on_each_node=True,
stdout: logging_dir=/tmp/tmpfc7yfgub/runs/Apr21_23-15-43_hf-dgx-01,
stdout: logging_first_step=False,
stdout: logging_nan_inf_filter=True,
stdout: logging_steps=500,
stdout: logging_strategy=steps,
stdout: lr_scheduler_type=linear,
stdout: max_grad_norm=1.0,
stdout: max_steps=-1,
stdout: metric_for_best_model=None,
stdout: mp_parameters=,
stdout: no_cuda=False,
stdout: num_train_epochs=1.0,
stdout: optim=adamw_hf,
stdout: optim_args=None,
stdout: output_dir=/tmp/tmpfc7yfgub,
stdout: overwrite_output_dir=True,
stdout: past_index=-1,
stdout: per_device_eval_batch_size=2,
stdout: per_device_train_batch_size=2,
stdout: predict_with_generate=True,
stdout: prediction_loss_only=False,
stdout: push_to_hub=False,
stdout: push_to_hub_model_id=None,
stdout: push_to_hub_organization=None,
stdout: push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
stdout: ray_scope=last,
stdout: remove_unused_columns=True,
stdout: report_to=[],
stdout: resume_from_checkpoint=None,
stdout: run_name=/tmp/tmpfc7yfgub,
stdout: save_on_each_node=False,
stdout: save_safetensors=False,
stdout: save_steps=0,
stdout: save_strategy=steps,
stdout: save_total_limit=None,
stdout: seed=42,
stdout: sharded_ddp=[],
stdout: skip_memory_metrics=True,
stdout: sortish_sampler=False,
stdout: tf32=None,
stdout: torch_compile=False,
stdout: torch_compile_backend=None,
stdout: torch_compile_mode=None,
stdout: torchdynamo=None,
stdout: tpu_metrics_debug=False,
stdout: tpu_num_cores=None,
stdout: use_ipex=False,
stdout: use_legacy_prediction_loop=False,
stdout: use_mps_device=False,
stdout: warmup_ratio=0.0,
stdout: warmup_steps=8,
stdout: weight_decay=0.0,
stdout: xpu_backend=None,
stdout: )
stdout: 04/21/2023 23:15:44 - WARNING - datasets.builder - Using custom data configuration default-9598b3d69cbcd432
stdout: 04/21/2023 23:15:44 - WARNING - datasets.builder - Found cached dataset json (/home/sourab/.cache/huggingface/datasets/json/default-9598b3d69cbcd432/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████| 2/2 [00:00<00:00, 1229.82it/s]
stdout: 04/21/2023 23:15:44 - WARNING - datasets.builder - Using custom data configuration default-9598b3d69cbcd432
stdout: 04/21/2023 23:15:44 - INFO - datasets.info - Loading Dataset Infos from /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/datasets/packaged_modules/json
stdout: 04/21/2023 23:15:44 - INFO - datasets.builder - Overwrite dataset info from restored data version.
stdout: 04/21/2023 23:15:44 - INFO - datasets.info - Loading Dataset info from /home/sourab/.cache/huggingface/datasets/json/default-9598b3d69cbcd432/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51
stdout: 04/21/2023 23:15:44 - WARNING - datasets.builder - Found cached dataset json (/home/sourab/.cache/huggingface/datasets/json/default-9598b3d69cbcd432/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
stdout: 04/21/2023 23:15:44 - INFO - datasets.info - Loading Dataset info from /home/sourab/.cache/huggingface/datasets/json/default-9598b3d69cbcd432/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51
100%|██████████| 2/2 [00:00<00:00, 1200.60it/s]
stderr: [INFO|configuration_utils.py:669] 2023-04-21 23:15:45,037 >> loading configuration file config.json from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/config.json
stderr: [INFO|configuration_utils.py:725] 2023-04-21 23:15:45,040 >> Model config T5Config {
stderr: "_name_or_path": "t5-small",
stderr: "architectures": [
stderr: "T5ForConditionalGeneration"
stderr: ],
stderr: "d_ff": 2048,
stderr: "d_kv": 64,
stderr: "d_model": 512,
stderr: "decoder_start_token_id": 0,
stderr: "dense_act_fn": "relu",
stderr: "dropout_rate": 0.1,
stderr: "eos_token_id": 1,
stderr: "feed_forward_proj": "relu",
stderr: "initializer_factor": 1.0,
stderr: "is_encoder_decoder": true,
stderr: "is_gated_act": false,
stderr: "layer_norm_epsilon": 1e-06,
stderr: "model_type": "t5",
stderr: "n_positions": 512,
stderr: "num_decoder_layers": 6,
stderr: "num_heads": 8,
stderr: "num_layers": 6,
stderr: "output_past": true,
stderr: "pad_token_id": 0,
stderr: "relative_attention_max_distance": 128,
stderr: "relative_attention_num_buckets": 32,
stderr: "task_specific_params": {
stderr: "summarization": {
stderr: "early_stopping": true,
stderr: "length_penalty": 2.0,
stderr: "max_length": 200,
stderr: "min_length": 30,
stderr: "no_repeat_ngram_size": 3,
stderr: "num_beams": 4,
stderr: "prefix": "summarize: "
stderr: },
stderr: "translation_en_to_de": {
stderr: "early_stopping": true,
stderr: "max_length": 300,
stderr: "num_beams": 4,
stderr: "prefix": "translate English to German: "
stderr: },
stderr: "translation_en_to_fr": {
stderr: "early_stopping": true,
stderr: "max_length": 300,
stderr: "num_beams": 4,
stderr: "prefix": "translate English to French: "
stderr: },
stderr: "translation_en_to_ro": {
stderr: "early_stopping": true,
stderr: "max_length": 300,
stderr: "num_beams": 4,
stderr: "prefix": "translate English to Romanian: "
stderr: }
stderr: },
stderr: "transformers_version": "4.29.0.dev0",
stderr: "use_cache": true,
stderr: "vocab_size": 32128
stderr: }
stderr:
stderr: [INFO|tokenization_auto.py:502] 2023-04-21 23:15:45,160 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
stderr: [INFO|configuration_utils.py:669] 2023-04-21 23:15:45,272 >> loading configuration file config.json from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/config.json
stderr: [INFO|configuration_utils.py:725] 2023-04-21 23:15:45,273 >> Model config T5Config {
stderr: "_name_or_path": "t5-small",
stderr: "architectures": [
stderr: "T5ForConditionalGeneration"
stderr: ],
stderr: "d_ff": 2048,
stderr: "d_kv": 64,
stderr: "d_model": 512,
stderr: "decoder_start_token_id": 0,
stderr: "dense_act_fn": "relu",
stderr: "dropout_rate": 0.1,
stderr: "eos_token_id": 1,
stderr: "feed_forward_proj": "relu",
stderr: "initializer_factor": 1.0,
stderr: "is_encoder_decoder": true,
stderr: "is_gated_act": false,
stderr: "layer_norm_epsilon": 1e-06,
stderr: "model_type": "t5",
stderr: "n_positions": 512,
stderr: "num_decoder_layers": 6,
stderr: "num_heads": 8,
stderr: "num_layers": 6,
stderr: "output_past": true,
stderr: "pad_token_id": 0,
stderr: "relative_attention_max_distance": 128,
stderr: "relative_attention_num_buckets": 32,
stderr: "task_specific_params": {
stderr: "summarization": {
stderr: "early_stopping": true,
stderr: "length_penalty": 2.0,
stderr: "max_length": 200,
stderr: "min_length": 30,
stderr: "no_repeat_ngram_size": 3,
stderr: "num_beams": 4,
stderr: "prefix": "summarize: "
stderr: },
stderr: "translation_en_to_de": {
stderr: "early_stopping": true,
stderr: "max_length": 300,
stderr: "num_beams": 4,
stderr: "prefix": "translate English to German: "
stderr: },
stderr: "translation_en_to_fr": {
stderr: "early_stopping": true,
stderr: "max_length": 300,
stderr: "num_beams": 4,
stderr: "prefix": "translate English to French: "
stderr: },
stderr: "translation_en_to_ro": {
stderr: "early_stopping": true,
stderr: "max_length": 300,
stderr: "num_beams": 4,
stderr: "prefix": "translate English to Romanian: "
stderr: }
stderr: },
stderr: "transformers_version": "4.29.0.dev0",
stderr: "use_cache": true,
stderr: "vocab_size": 32128
stderr: }
stderr:
stderr: /home/sourab/transformers/src/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
stderr: For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
stderr: - Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
stderr: - If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
stderr: - To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
stderr: warnings.warn(
stderr: /home/sourab/transformers/src/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr: with safe_open(checkpoint_file, framework="pt") as f:
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr: return self.fget.__get__(instance, owner)()
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr: storage = cls(wrap_storage=untyped_storage)
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr: with safe_open(filename, framework="pt", device=device) as f:
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file spiece.model from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/spiece.model
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file tokenizer.json from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/tokenizer.json
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file added_tokens.json from cache at None
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file special_tokens_map.json from cache at None
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file tokenizer_config.json from cache at None
stderr: [INFO|configuration_utils.py:669] 2023-04-21 23:15:45,539 >> loading configuration file config.json from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/config.json
stderr: [INFO|configuration_utils.py:725] 2023-04-21 23:15:45,540 >> Model config T5Config {
stderr: "_name_or_path": "t5-small",
stderr: "architectures": [
stderr: "T5ForConditionalGeneration"
stderr: ],
stderr: "d_ff": 2048,
stderr: "d_kv": 64,
stderr: "d_model": 512,
stderr: "decoder_start_token_id": 0,
stderr: "dense_act_fn": "relu",
stderr: "dropout_rate": 0.1,
stderr: "eos_token_id": 1,
stderr: "feed_forward_proj": "relu",
stderr: "initializer_factor": 1.0,
stderr: "is_encoder_decoder": true,
stderr: "is_gated_act": false,
stderr: "layer_norm_epsilon": 1e-06,
stderr: "model_type": "t5",
stderr: "n_positions": 512,
stderr: "num_decoder_layers": 6,
stderr: "num_heads": 8,
stderr: "num_layers": 6,
stderr: "output_past": true,
stderr: "pad_token_id": 0,
stderr: "relative_attention_max_distance": 128,
stderr: "relative_attention_num_buckets": 32,
stderr: "task_specific_params": {
stderr: "summarization": {
stderr: "early_stopping": true,
stderr: "length_penalty": 2.0,
stderr: "max_length": 200,
stderr: "min_length": 30,
stderr: "no_repeat_ngram_size": 3,
stderr: "num_beams": 4,
stderr: "prefix": "summarize: "
stderr: },
stderr: "translation_en_to_de": {
stderr: "early_stopping": true,
stderr: "max_length": 300,
stderr: "num_beams": 4,
stderr: "prefix": "translate English to German: "
stderr: },
stderr: "translation_en_to_fr": {
stderr: "early_stopping": true,
stderr: "max_length": 300,
stderr: "num_beams": 4,
stderr: "prefix": "translate English to French: "
stderr: },
stderr: "translation_en_to_ro": {
stderr: "early_stopping": true,
stderr: "max_length": 300,
stderr: "num_beams": 4,
stderr: "prefix": "translate English to Romanian: "
stderr: }
stderr: },
stderr: "transformers_version": "4.29.0.dev0",
stderr: "use_cache": true,
stderr: "vocab_size": 32128
stderr: }
stderr:
stderr: /home/sourab/transformers/src/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
stderr: For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
stderr: - Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
stderr: - If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
stderr: - To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
stderr: warnings.warn(
stderr: [INFO|modeling_t5.py:268] 2023-04-21 23:15:45,583 >> Discovered apex.normalization.FusedRMSNorm - will use it instead of T5LayerNorm
stderr: [INFO|modeling_utils.py:2534] 2023-04-21 23:15:45,585 >> loading weights file model.safetensors from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/model.safetensors
stderr: /home/sourab/transformers/src/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr: with safe_open(checkpoint_file, framework="pt") as f:
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr: return self.fget.__get__(instance, owner)()
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr: storage = cls(wrap_storage=untyped_storage)
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr: with safe_open(filename, framework="pt", device=device) as f:
stderr: [INFO|modeling_utils.py:2623] 2023-04-21 23:15:45,592 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
stderr: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
stderr: │ /home/sourab/transformers/examples/pytorch/translation/run_translation.py:666 in <module> │
stderr: │ │
stderr: │ 663 │
stderr: │ 664 │
stderr: │ 665 if __name__ == "__main__": │
stderr: │ ❱ 666 │ main() │
stderr: │ 667 │
stderr: │ │
stderr: │ /home/sourab/transformers/examples/pytorch/translation/run_translation.py:378 in main │
stderr: │ │
stderr: │ 375 │ │ revision=model_args.model_revision, │
stderr: │ 376 │ │ use_auth_token=True if model_args.use_auth_token else None, │
stderr: │ 377 │ ) │
stderr: │ ❱ 378 │ model = AutoModelForSeq2SeqLM.from_pretrained( │
stderr: │ 379 │ │ model_args.model_name_or_path, │
stderr: │ 380 │ │ from_tf=bool(".ckpt" in model_args.model_name_or_path), │
stderr: │ 381 │ │ config=config, │
stderr: │ │
stderr: │ /home/sourab/transformers/src/transformers/models/auto/auto_factory.py:468 in from_pretrained │
stderr: │ │
stderr: │ 465 │ │ │ ) │
stderr: │ 466 │ │ elif type(config) in cls._model_mapping.keys(): │
stderr: │ 467 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │
stderr: │ ❱ 468 │ │ │ return model_class.from_pretrained( │
stderr: │ 469 │ │ │ │ pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, │
stderr: │ 470 │ │ │ ) │
stderr: │ 471 │ │ raise ValueError( │
stderr: │ │
stderr: │ /home/sourab/transformers/src/transformers/modeling_utils.py:2624 in from_pretrained │
stderr: │ │
stderr: │ 2621 │ │ │ import deepspeed │
stderr: │ 2622 │ │ │ │
stderr: │ 2623 │ │ │ logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this mode │
stderr: │ ❱ 2624 │ │ │ init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config()) │
stderr: │ 2625 │ │ elif load_in_8bit or low_cpu_mem_usage: │
stderr: │ 2626 │ │ │ init_contexts.append(init_empty_weights()) │
stderr: │ 2627 │
stderr: │ │
stderr: │ /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_pa │
stderr: │ rameters.py:722 in __init__ │
stderr: │ │
stderr: │ 719 │ │ │ config_dict_or_path = config │
stderr: │ 720 │ │ │ logger.warning( │
stderr: │ 721 │ │ │ │ f'zero.Init: the `config` argument is deprecated. Please use `config_dic │
stderr: │ ❱ 722 │ │ _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path, │
stderr: │ 723 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ mpu) if config_dict_or_pat │
stderr: │ 724 │ │ if _ds_config is not None: │
stderr: │ 725 │ │ │ mem_efficient_linear = _ds_config.zero_config.memory_efficient_linear │
stderr: │ │
stderr: │ /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/deepspeed/runtime/config.py:764 in │
stderr: │ __init__ │
stderr: │ │
stderr: │ 761 │ │ │
stderr: │ 762 │ │ # Pass a copy so that user json is unmodified, e.g. for logging │
stderr: │ 763 │ │ self._initialize_params(copy.copy(self._param_dict)) │
stderr: │ ❱ 764 │ │ self._configure_train_batch_size() │
stderr: │ 765 │ │ self._do_sanity_check() │
stderr: │ 766 │ │
stderr: │ 767 │ def _initialize_params(self, param_dict): │
stderr: │ │
stderr: │ /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/deepspeed/runtime/config.py:935 in │
stderr: │ _configure_train_batch_size │
stderr: │ │
stderr: │ 932 │ │
stderr: │ 933 │ def _configure_train_batch_size(self): │
stderr: │ 934 │ │ self._set_batch_related_parameters() │
stderr: │ ❱ 935 │ │ self._batch_assertion() │
stderr: │ 936 │ │
stderr: │ 937 │ def _do_sanity_check(self): │
stderr: │ 938 │ │ self._do_error_check() │
stderr: │ │
stderr: │ /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/deepspeed/runtime/config.py:883 in │
stderr: │ _batch_assertion │
stderr: │ │
stderr: │ 880 │ │ │
stderr: │ 881 │ │ assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be great │
stderr: │ 882 │ │ │
stderr: │ ❱ 883 │ │ assert train_batch == micro_batch * grad_acc * self.world_size, ( │
stderr: │ 884 │ │ │ f"Check batch related parameters. train_batch_size is not equal " │
stderr: │ 885 │ │ │ "to micro_batch_per_gpu * gradient_acc_step * world_size " │
stderr: │ 886 │ │ │ f"{train_batch} != {micro_batch} * {grad_acc} * {self.world_size}") │
stderr: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
stderr: AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu
stderr: * gradient_acc_step * world_size 4 != 2 * 1 * 1
@pacman100 / @muellerzr do either of you know when this test was last passing? I thought this might be related to changes we made in v0.9.0
, but that does not seem to be the case. The test is failing with deepspeed==0.8.3
as well:
deepspeed 0.8.3
torch 2.0.0.dev20230301+cu118
transformers 4.29.0.dev0
@mrwyattii, I don't know that but looking at git blame I knew that it was an issue with earlier deepspeed version too but as the fix will be in the latest version, just reported that
Got it, and in the branch that @muellerzr has created (muellerzr-bring-deepspeed-back
) - the call to deepspeed.init_distributed
is removed. Why is that done?
@mrwyattii this is because we're integrating with Accelerate
to handle all the distributed code in Trainer
. You can see the code we use to set everything up on the Accelerate
side here: https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L112-L129 (Which that env var was an oversight, apologies!)
@pacman100 correct me if I'm wrong here, but with the accelerate integration if we're starting from python etc like we have, we need to use ACCELERATE_USE_DEEPSPEED="true"
when launching the test, no? (To note, iirc when I did this it still failed, rerunning now)
Doing so will make tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_fp16
fail, with the same error as stated. Please try running with: CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" ACCELERATE_USE_DEEPSPEED="yes" pytest -sv tests/deepspeed/test_deepspeed.py -k test_basic_distributed
to replicate
@muellerzr, yes but in this case it is not the cause. deepspeed.launcher.launch
is initialising the dist setup by creating n processes (world_size=n), but the zero_init is trying to get the DS config which checks the train_batch validation before it updates its global dist.
Doing so will make
tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_fp16
fail, with the same error as stated. Please try running with:CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" ACCELERATE_USE_DEEPSPEED="yes" pytest -sv tests/deepspeed/test_deepspeed.py -k test_basic_distributed
to replicate
With ACCELERATE_USE_DEEPSPEED="yes"
the test fails with the same batch size error, but I think that's because dist
is never initialized. Setting ACCELERATE_USE_DEEPSPEED="true"
will cause the DeepSpeed initialization to happen:
https://github.com/huggingface/accelerate/blob/565152183334f709ac955204ef663023d1f63b7a/src/accelerate/state.py#L112
In this case, the error I see is with the torch.distirbuted.init_process_group
call on line 121:
../venv/lib/python3.8/site-packages/accelerate/state.py:122: in __init__
torch.distributed.init_process_group(backend=self.backend, **kwargs)
../../../.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:899: in init_process_group
store, rank, world_size = next(rendezvous_iterator)
../../../.local/lib/python3.8/site-packages/torch/distributed/rendezvous.py:235: in _env_rendezvous_handler
rank = int(_get_env_or_raise("RANK"))
../../../.local/lib/python3.8/site-packages/torch/distributed/rendezvous.py:220: in _get_env_or_raise
raise _env_error(env_var)
E ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
It seems the problem lies there. I don't think we have ever initialized dist
from deepspeed.zero.Init
- so I don't think this is a change on our side that caused this error.
@mrwyattii, how so, it is failing with previous version of transformers too as I posted above
@pacman100 The tests pass for me with transformers==4.28.1
and on latest main
. It looks like the breaking change was this PR where deepspeed.init_distributed
was removed:
https://github.com/huggingface/transformers/pull/22752/files#diff-bfceaff300c851b8e24fc50dc6638482abaec8f7d2a718e877c3828c166bcf79L1626
And then that change was reverted here: https://github.com/huggingface/transformers/pull/22899/files#diff-bfceaff300c851b8e24fc50dc6638482abaec8f7d2a718e877c3828c166bcf79R1554
Hello @mrwyattii, thank you for the pointers 😄. In my env, even though the pip list was showing transformers==4.28.1, it was actually using Zach's branch, which is weird. I can confirm that this isn't an issue with DeepSpeed and this PR of Accelerate https://github.com/huggingface/accelerate/pull/1352 should fix the issues with the trainer and DeepSpeed.
we can close this issue
Describe the bug The same issue as https://github.com/microsoft/DeepSpeed/issues/3228, except for stage3 with zero init
To Reproduce Steps to reproduce the behavior:
accelerate
andtransformers
from source w/ the new Accelerate trainer integration (pip install git+https://github.com/huggingface/accelerate git+https://github.com/huggingface/transformers@muellerzr-bring-deepspeed-back
)transformers
repo:CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" pytest -sv tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_clm_from_config_zero3_fp16
stderr: AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 4 != 2 * 1 * 1
Expected behavior Integration test should pass
ds_report output Please run
ds_report
to give us details about your setup.Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context Are you launching your experiment with the
deepspeed
launcher, MPI, or something else? It uses thedeepspeed
launcher, as shown here for the test: https://github.com/huggingface/transformers/blob/main/tests/deepspeed/test_deepspeed.py#L119-L126cc @pacman100 @stas00