microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.72k stars 4.15k forks source link

[BUG] Zero3 with zero_init will error if the config is created before dist init #3341

Closed muellerzr closed 1 year ago

muellerzr commented 1 year ago

Describe the bug The same issue as https://github.com/microsoft/DeepSpeed/issues/3228, except for stage3 with zero init

To Reproduce Steps to reproduce the behavior:

  1. Install accelerate and transformers from source w/ the new Accelerate trainer integration (pip install git+https://github.com/huggingface/accelerate git+https://github.com/huggingface/transformers@muellerzr-bring-deepspeed-back)
  2. Run the following test inside the transformers repo: CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" pytest -sv tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_clm_from_config_zero3_fp16
  3. The test will fail with stderr: AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 4 != 2 * 1 * 1

Expected behavior Integration test should pass

ds_report output Please run ds_report to give us details about your setup.

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/zach_mueller_huggingface_co/miniconda3/envs/accelerate/lib/python3.9/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/home/zach_mueller_huggingface_co/miniconda3/envs/accelerate/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? It uses the deepspeed launcher, as shown here for the test: https://github.com/huggingface/transformers/blob/main/tests/deepspeed/test_deepspeed.py#L119-L126

cc @pacman100 @stas00

pacman100 commented 1 year ago

Same issue with transformers==4.28.1, deepspeed==0.9.1 and accelerate==0.18.0

In main transformers folder, run:

CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" pytest -sv tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_bf16
mrwyattii commented 1 year ago

@pacman100 The test you shared passes for me. I've matched the version of each package you listed. Could you double check that this test is failing for you? CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" pytest -sv tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_bf16

UPDATE: Nevermind, I didn't see there was a branch for transformers I should be on: muellerzr-bring-deepspeed-back

Able to replicate on my side - working on a fix.

pacman100 commented 1 year ago
  1. pip list
    absl-py                  1.4.0
    accelerate               0.18.0
    addict                   2.4.0
    aiofiles                 22.1.0
    aiohttp                  3.8.3
    aiosignal                1.2.0
    aiosqlite                0.18.0
    altair                   4.2.0
    ansible                  7.1.0
    ansible-core             2.14.1
    ansible-vault            2.1.0
    antlr4-python3-runtime   4.9.3
    anyio                    3.6.2
    apex                     0.1
    appdirs                  1.4.4
    argon2-cffi              21.3.0
    argon2-cffi-bindings     21.2.0
    arrow                    1.2.3
    asttokens                2.0.8
    async-timeout            4.0.2
    attrs                    22.1.0
    audioread                3.0.0
    Babel                    2.11.0
    backcall                 0.2.0
    backoff                  2.2.1
    base58                   2.1.1
    beautifulsoup4           4.11.1
    bertviz                  1.4.0
    binaryornot              0.4.4
    bitsandbytes             0.37.0
    black                    23.3.0
    bleach                   5.0.1
    blessed                  1.20.0
    bokeh                    2.4.3
    boto3                    1.26.64
    botocore                 1.29.64
    Brotli                   1.0.9
    brotlipy                 0.7.0
    cachetools               5.2.0
    certifi                  2022.12.7
    cffi                     1.15.1
    chardet                  5.1.0
    charset-normalizer       2.1.1
    click                    8.1.3
    cmake                    3.25.0
    codecov                  2.1.12
    colorama                 0.4.4
    colorcet                 3.0.1
    coloredlogs              15.0.1
    commonmark               0.9.1
    contourpy                1.0.5
    cookiecutter             2.1.1
    coverage                 6.5.0
    coveralls                3.3.1
    cryptography             37.0.1
    cycler                   0.11.0
    cytoolz                  0.12.0
    datasets                 2.9.0
    debugpy                  1.6.3
    decorator                5.1.1
    deepspeed                0.9.1
    defusedxml               0.7.1
    diffusers                0.12.0.dev0      /home/sourab/diffusers
    dill                     0.3.5.1
    docker-pycreds           0.4.0
    docopt                   0.6.2
    docutils                 0.16
    ecdsa                    0.18.0
    entrypoints              0.4
    eth-hash                 0.5.1
    eth-keys                 0.4.0
    eth-typing               3.2.0
    eth-utils                2.1.0
    evaluate                 0.2.2
    exceptiongroup           1.0.4
    execnet                  1.9.0
    executing                1.1.0
    fastapi                  0.89.1
    fastjsonschema           2.16.2
    ffmpy                    0.3.0
    filelock                 3.9.0
    fire                     0.5.0
    flatbuffers              23.1.21
    flexgen                  0.1.7
    flit_core                3.6.0
    fonttools                4.37.3
    fqdn                     1.5.1
    frozenlist               1.3.1
    fsspec                   2022.8.2
    ftfy                     6.1.1
    fuzzywuzzy               0.18.0
    gitdb                    4.0.9
    GitPython                3.1.27
    gmpy2                    2.1.2
    google-api-core          2.8.2
    google-api-python-client 2.69.0
    google-auth              2.15.0
    google-auth-httplib2     0.1.0
    googleapis-common-protos 1.56.4
    gpustat                  1.1
    gradio                   3.25.0
    gradio_client            0.1.3
    grpcio                   1.42.0
    grpcio-tools             1.42.0
    h11                      0.14.0
    hjson                    3.1.0
    holoviews                1.15.4
    httpcore                 0.16.3
    httplib2                 0.21.0
    httpx                    0.23.3
    huggingface-hub          0.13.4
    humanfriendly            10.0
    hydra-core               1.3.0
    hypothesis               6.61.0
    idna                     3.4
    importlib-metadata       6.0.0
    inflate64                0.3.1
    iniconfig                1.1.1
    ipykernel                6.16.0
    ipython                  8.5.0
    ipython-genutils         0.2.0
    ipywidgets               8.0.2
    isoduration              20.11.0
    jedi                     0.18.1
    Jinja2                   3.1.2
    jinja2-time              0.2.0
    jiwer                    2.5.1
    jmespath                 1.0.1
    joblib                   1.2.0
    json5                    0.9.11
    jsonpointer              2.3
    jsonschema               4.17.3
    jupyter                  1.0.0
    jupyter_client           8.0.2
    jupyter-console          6.4.4
    jupyter_core             5.2.0
    jupyter-events           0.5.0
    jupyter_server           2.2.1
    jupyter_server_fileid    0.6.0
    jupyter_server_terminals 0.4.4
    jupyter_server_ydoc      0.6.1
    jupyter-ydoc             0.2.2
    jupyterlab               3.6.1
    jupyterlab-pygments      0.2.2
    jupyterlab_server        2.19.0
    jupyterlab-widgets       3.0.3
    kiwisolver               1.4.4
    Levenshtein              0.20.2
    librosa                  0.9.2
    linkify-it-py            1.0.3
    lit                      15.0.7
    llama-cpp-python         0.1.34
    llvmlite                 0.39.1
    loguru                   0.6.0
    loralib                  0.1.1
    lxml                     4.9.1
    Markdown                 3.4.3
    markdown-it-py           2.1.0
    MarkupSafe               2.1.1
    matplotlib               3.6.0
    matplotlib-inline        0.1.6
    mdit-py-plugins          0.3.3
    mdurl                    0.1.2
    megatron-lm              3.0.0            /home/sourab/Megatron-LM
    miniupnpc                2.0.2
    mistune                  2.0.4
    mkl-fft                  1.3.1
    mkl-random               1.2.2
    mkl-service              2.4.0
    more-itertools           9.0.0
    mpmath                   1.2.1
    msgpack                  1.0.4
    msgpack-numpy            0.4.7.1
    multidict                6.0.2
    multiprocess             0.70.13
    multivolumefile          0.2.3
    munch                    2.5.0
    mypy-extensions          1.0.0
    nbclassic                0.5.1
    nbclient                 0.6.8
    nbconvert                7.0.0
    nbformat                 5.6.1
    nest-asyncio             1.5.5
    netaddr                  0.8.0
    networkx                 3.0rc1
    ninja                    1.10.2.3
    nltk                     3.8.1
    notebook                 6.4.12
    notebook_shim            0.2.2
    numba                    0.56.4
    numpy                    1.24.1
    nvidia-ml-py             11.525.112
    omegaconf                2.3.0
    onnx                     1.13.0
    onnxruntime-gpu          1.13.1
    optimum                  1.8.2
    orjson                   3.8.5
    packaging                23.1
    pandarallel              1.6.3
    pandas                   1.5.0
    pandocfilters            1.5.0
    panel                    0.14.4
    param                    1.13.0
    parameterized            0.8.1
    parso                    0.8.3
    password-strength        0.0.3.post2
    pathspec                 0.11.1
    pathtools                0.1.2
    peft                     0.3.0.dev0       /home/sourab/pet
    pexpect                  4.8.0
    pickleshare              0.7.5
    Pillow                   9.5.0
    pip                      23.0.1
    platformdirs             2.6.2
    pluggy                   1.0.0
    pooch                    1.6.0
    portalocker              2.5.1
    prometheus-client        0.14.1
    promise                  2.3
    prompt-toolkit           3.0.31
    protobuf                 3.20.2
    psutil                   5.9.2
    ptyprocess               0.7.0
    PuLP                     2.7.0
    pure-eval                0.2.2
    py                       1.11.0
    py-bip39-bindings        0.1.10
    py-cpuinfo               8.0.0
    py-ed25519-bindings      1.0.2
    py-sr25519-bindings      0.2.0
    py7zr                    0.20.2
    pyarrow                  9.0.0
    pyasn1                   0.4.8
    pyasn1-modules           0.2.8
    pybcj                    1.0.1
    pybind11                 2.10.0
    pycparser                2.21
    pycryptodome             3.11.0
    pycryptodomex            3.16.0
    pyct                     0.5.0
    pydantic                 1.10.2
    pydub                    0.25.1
    Pygments                 2.14.0
    pyOpenSSL                22.0.0
    pyparsing                3.0.9
    pyppmd                   1.0.0
    pyrsistent               0.18.1
    PySocks                  1.7.1
    pytesseract              0.3.10
    pytest                   7.2.0
    pytest-cov               4.0.0
    pytest-rerunfailures     10.3
    pytest-split             0.8.0
    pytest-xdist             3.1.0
    python-dateutil          2.8.2
    python-json-logger       2.0.4
    python-Levenshtein       0.12.1
    python-multipart         0.0.5
    python-slugify           7.0.0
    pytorch-triton           2.0.0+b8b470bc59
    pytz                     2022.2.1
    pyviz-comms              2.2.1
    PyYAML                   5.4.1
    pyzmq                    24.0.1
    pyzstd                   0.15.3
    qqdm                     0.0.7
    qtconsole                5.3.2
    QtPy                     2.2.0
    rapidfuzz                2.13.7
    regex                    2022.9.13
    requests                 2.28.1
    resampy                  0.4.2
    resolvelib               0.8.1
    responses                0.18.0
    retry                    0.9.2
    rfc3339-validator        0.1.4
    rfc3986                  1.5.0
    rfc3986-validator        0.1.1
    rich                     13.3.1
    rouge-score              0.1.2
    rsa                      4.7.2
    rwkv                     0.7.3
    s3transfer               0.6.0
    sacrebleu                2.2.1
    safetensors              0.3.0
    scalecodec               1.0.48
    scikit-learn             1.1.3
    scipy                    1.9.1
    seaborn                  0.12.2
    semantic-version         2.10.0
    Send2Trash               1.8.0
    sentencepiece            0.1.97
    sentry-sdk               1.9.9
    seqeval                  1.2.2
    setproctitle             1.3.2
    setuptools               63.4.1
    shortuuid                1.0.9
    six                      1.16.0
    smmap                    5.0.0
    sniffio                  1.3.0
    sortedcontainers         2.4.0
    soundfile                0.11.0
    soupsieve                2.3.2.post1
    stack-data               0.5.1
    starlette                0.22.0
    substrate-interface      1.2.4
    sympy                    1.11.1
    tabulate                 0.8.10
    termcolor                2.1.1
    terminado                0.15.0
    text-unidecode           1.3
    texttable                1.6.7
    threadpoolctl            3.1.0
    tinycss2                 1.1.1
    tokenize-rt              5.0.0
    tokenizer                3.4.2
    tokenizers               0.13.3
    tomli                    2.0.1
    toolz                    0.12.0
    torch                    2.0.0
    torchaudio               2.0.0
    torchvision              0.15.0
    tornado                  6.2
    tqdm                     4.64.1
    traitlets                5.9.0
    transformers             4.28.1
    triton                   2.0.0
    trl                      0.2.2.dev0       /home/sourab/trl
    typer                    0.7.0
    typing_extensions        4.5.0
    uc-micro-py              1.0.1
    uri-template             1.2.0
    uritemplate              4.1.1
    urllib3                  1.26.14
    uvicorn                  0.20.0
    wandb                    0.13.3
    wcwidth                  0.2.5
    webcolors                1.12
    webencodings             0.5.1
    websocket-client         1.4.2
    websockets               10.4
    wheel                    0.37.1
    widgetsnbextension       4.0.3
    xxhash                   2.0.2
    y-py                     0.5.5
    yarl                     1.8.1
    ypy-websocket            0.8.2
    zipp                     3.11.0

Command and output:

CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" pytest -sv tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_bf16
======================================= test session starts ========================================
platform linux -- Python 3.10.4, pytest-7.2.0, pluggy-1.0.0 -- /home/sourab/miniconda3/envs/ml/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/sourab/transformers/.hypothesis/examples')
rootdir: /home/sourab/transformers, configfile: setup.cfg
plugins: anyio-3.6.2, rerunfailures-10.3, xdist-3.1.0, hypothesis-6.61.0, split-0.8.0, cov-4.0.0, hydra-core-1.3.0
collecting ... 
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
collected 1 item                                                                                   

tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_bf16 
Running:  deepspeed --num_nodes 1 --num_gpus 2 --master_port 10999 /home/sourab/transformers/examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --train_file /home/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/train.json --validation_file /home/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/val.json --output_dir /tmp/tmpfc7yfgub --overwrite_output_dir --max_source_length 32 --max_target_length 32 --val_max_target_length 32 --warmup_steps 8 --predict_with_generate --save_steps 0 --eval_steps 10 --group_by_length --label_smoothing_factor 0.1 --source_lang en --target_lang ro --report_to none --source_prefix "translate English to Romanian: " --bf16 --do_train --num_train_epochs 1 --max_train_samples 16 --per_device_train_batch_size 2 --learning_rate 3e-3 --do_eval --max_eval_samples 16 --per_device_eval_batch_size 2 --deepspeed /home/sourab/transformers/tests/deepspeed/ds_config_zero3.json
stdout: [2023-04-21 23:15:37,900] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
stdout: Detected CUDA_VISIBLE_DEVICES=0,1 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
stdout: [2023-04-21 23:15:37,940] [INFO] [runner.py:540:main] cmd = /home/sourab/miniconda3/envs/ml/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=10999 --enable_each_rank_log=None /home/sourab/transformers/examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --train_file /home/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/train.json --validation_file /home/sourab/transformers/tests/deepspeed/../fixtures/tests_samples/wmt_en_ro/val.json --output_dir /tmp/tmpfc7yfgub --overwrite_output_dir --max_source_length 32 --max_target_length 32 --val_max_target_length 32 --warmup_steps 8 --predict_with_generate --save_steps 0 --eval_steps 10 --group_by_length --label_smoothing_factor 0.1 --source_lang en --target_lang ro --report_to none --source_prefix "translate English to Romanian: " --bf16 --do_train --num_train_epochs 1 --max_train_samples 16 --per_device_train_batch_size 2 --learning_rate 3e-3 --do_eval --max_eval_samples 16 --per_device_eval_batch_size 2 --deepspeed /home/sourab/transformers/tests/deepspeed/ds_config_zero3.json
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1]}
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=2, node_rank=0
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:247:main] dist_world_size=2
stdout: [2023-04-21 23:15:40,168] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1
stdout: 04/21/2023 23:15:44 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
stdout: 04/21/2023 23:15:44 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
stdout: 04/21/2023 23:15:44 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
stdout: _n_gpu=1,
stdout: adafactor=False,
stdout: adam_beta1=0.9,
stdout: adam_beta2=0.999,
stdout: adam_epsilon=1e-08,
stdout: auto_find_batch_size=False,
stdout: bf16=True,
stdout: bf16_full_eval=False,
stdout: data_seed=None,
stdout: dataloader_drop_last=False,
stdout: dataloader_num_workers=0,
stdout: dataloader_pin_memory=True,
stdout: ddp_bucket_cap_mb=None,
stdout: ddp_find_unused_parameters=None,
stdout: ddp_timeout=1800,
stdout: debug=[],
stdout: deepspeed=/home/sourab/transformers/tests/deepspeed/ds_config_zero3.json,
stdout: disable_tqdm=False,
stdout: do_eval=True,
stdout: do_predict=False,
stdout: do_train=True,
stdout: eval_accumulation_steps=None,
stdout: eval_delay=0,
stdout: eval_steps=10,
stdout: evaluation_strategy=no,
stdout: fp16=False,
stdout: fp16_backend=auto,
stdout: fp16_full_eval=False,
stdout: fp16_opt_level=O1,
stdout: fsdp=[],
stdout: fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
stdout: fsdp_min_num_params=0,
stdout: fsdp_transformer_layer_cls_to_wrap=None,
stdout: full_determinism=False,
stdout: generation_config=None,
stdout: generation_max_length=None,
stdout: generation_num_beams=None,
stdout: gradient_accumulation_steps=1,
stdout: gradient_checkpointing=False,
stdout: greater_is_better=None,
stdout: group_by_length=True,
stdout: half_precision_backend=auto,
stdout: hub_model_id=None,
stdout: hub_private_repo=False,
stdout: hub_strategy=every_save,
stdout: hub_token=<HUB_TOKEN>,
stdout: ignore_data_skip=False,
stdout: include_inputs_for_metrics=False,
stdout: jit_mode_eval=False,
stdout: label_names=None,
stdout: label_smoothing_factor=0.1,
stdout: learning_rate=0.003,
stdout: length_column_name=length,
stdout: load_best_model_at_end=False,
stdout: local_rank=0,
stdout: log_level=passive,
stdout: log_level_replica=warning,
stdout: log_on_each_node=True,
stdout: logging_dir=/tmp/tmpfc7yfgub/runs/Apr21_23-15-43_hf-dgx-01,
stdout: logging_first_step=False,
stdout: logging_nan_inf_filter=True,
stdout: logging_steps=500,
stdout: logging_strategy=steps,
stdout: lr_scheduler_type=linear,
stdout: max_grad_norm=1.0,
stdout: max_steps=-1,
stdout: metric_for_best_model=None,
stdout: mp_parameters=,
stdout: no_cuda=False,
stdout: num_train_epochs=1.0,
stdout: optim=adamw_hf,
stdout: optim_args=None,
stdout: output_dir=/tmp/tmpfc7yfgub,
stdout: overwrite_output_dir=True,
stdout: past_index=-1,
stdout: per_device_eval_batch_size=2,
stdout: per_device_train_batch_size=2,
stdout: predict_with_generate=True,
stdout: prediction_loss_only=False,
stdout: push_to_hub=False,
stdout: push_to_hub_model_id=None,
stdout: push_to_hub_organization=None,
stdout: push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
stdout: ray_scope=last,
stdout: remove_unused_columns=True,
stdout: report_to=[],
stdout: resume_from_checkpoint=None,
stdout: run_name=/tmp/tmpfc7yfgub,
stdout: save_on_each_node=False,
stdout: save_safetensors=False,
stdout: save_steps=0,
stdout: save_strategy=steps,
stdout: save_total_limit=None,
stdout: seed=42,
stdout: sharded_ddp=[],
stdout: skip_memory_metrics=True,
stdout: sortish_sampler=False,
stdout: tf32=None,
stdout: torch_compile=False,
stdout: torch_compile_backend=None,
stdout: torch_compile_mode=None,
stdout: torchdynamo=None,
stdout: tpu_metrics_debug=False,
stdout: tpu_num_cores=None,
stdout: use_ipex=False,
stdout: use_legacy_prediction_loop=False,
stdout: use_mps_device=False,
stdout: warmup_ratio=0.0,
stdout: warmup_steps=8,
stdout: weight_decay=0.0,
stdout: xpu_backend=None,
stdout: )
stdout: 04/21/2023 23:15:44 - WARNING - datasets.builder - Using custom data configuration default-9598b3d69cbcd432
stdout: 04/21/2023 23:15:44 - WARNING - datasets.builder - Found cached dataset json (/home/sourab/.cache/huggingface/datasets/json/default-9598b3d69cbcd432/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████| 2/2 [00:00<00:00, 1229.82it/s]
stdout: 04/21/2023 23:15:44 - WARNING - datasets.builder - Using custom data configuration default-9598b3d69cbcd432
stdout: 04/21/2023 23:15:44 - INFO - datasets.info - Loading Dataset Infos from /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/datasets/packaged_modules/json
stdout: 04/21/2023 23:15:44 - INFO - datasets.builder - Overwrite dataset info from restored data version.
stdout: 04/21/2023 23:15:44 - INFO - datasets.info - Loading Dataset info from /home/sourab/.cache/huggingface/datasets/json/default-9598b3d69cbcd432/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51
stdout: 04/21/2023 23:15:44 - WARNING - datasets.builder - Found cached dataset json (/home/sourab/.cache/huggingface/datasets/json/default-9598b3d69cbcd432/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
stdout: 04/21/2023 23:15:44 - INFO - datasets.info - Loading Dataset info from /home/sourab/.cache/huggingface/datasets/json/default-9598b3d69cbcd432/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51
100%|██████████| 2/2 [00:00<00:00, 1200.60it/s]
stderr: [INFO|configuration_utils.py:669] 2023-04-21 23:15:45,037 >> loading configuration file config.json from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/config.json
stderr: [INFO|configuration_utils.py:725] 2023-04-21 23:15:45,040 >> Model config T5Config {
stderr:   "_name_or_path": "t5-small",
stderr:   "architectures": [
stderr:     "T5ForConditionalGeneration"
stderr:   ],
stderr:   "d_ff": 2048,
stderr:   "d_kv": 64,
stderr:   "d_model": 512,
stderr:   "decoder_start_token_id": 0,
stderr:   "dense_act_fn": "relu",
stderr:   "dropout_rate": 0.1,
stderr:   "eos_token_id": 1,
stderr:   "feed_forward_proj": "relu",
stderr:   "initializer_factor": 1.0,
stderr:   "is_encoder_decoder": true,
stderr:   "is_gated_act": false,
stderr:   "layer_norm_epsilon": 1e-06,
stderr:   "model_type": "t5",
stderr:   "n_positions": 512,
stderr:   "num_decoder_layers": 6,
stderr:   "num_heads": 8,
stderr:   "num_layers": 6,
stderr:   "output_past": true,
stderr:   "pad_token_id": 0,
stderr:   "relative_attention_max_distance": 128,
stderr:   "relative_attention_num_buckets": 32,
stderr:   "task_specific_params": {
stderr:     "summarization": {
stderr:       "early_stopping": true,
stderr:       "length_penalty": 2.0,
stderr:       "max_length": 200,
stderr:       "min_length": 30,
stderr:       "no_repeat_ngram_size": 3,
stderr:       "num_beams": 4,
stderr:       "prefix": "summarize: "
stderr:     },
stderr:     "translation_en_to_de": {
stderr:       "early_stopping": true,
stderr:       "max_length": 300,
stderr:       "num_beams": 4,
stderr:       "prefix": "translate English to German: "
stderr:     },
stderr:     "translation_en_to_fr": {
stderr:       "early_stopping": true,
stderr:       "max_length": 300,
stderr:       "num_beams": 4,
stderr:       "prefix": "translate English to French: "
stderr:     },
stderr:     "translation_en_to_ro": {
stderr:       "early_stopping": true,
stderr:       "max_length": 300,
stderr:       "num_beams": 4,
stderr:       "prefix": "translate English to Romanian: "
stderr:     }
stderr:   },
stderr:   "transformers_version": "4.29.0.dev0",
stderr:   "use_cache": true,
stderr:   "vocab_size": 32128
stderr: }
stderr: 
stderr: [INFO|tokenization_auto.py:502] 2023-04-21 23:15:45,160 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
stderr: [INFO|configuration_utils.py:669] 2023-04-21 23:15:45,272 >> loading configuration file config.json from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/config.json
stderr: [INFO|configuration_utils.py:725] 2023-04-21 23:15:45,273 >> Model config T5Config {
stderr:   "_name_or_path": "t5-small",
stderr:   "architectures": [
stderr:     "T5ForConditionalGeneration"
stderr:   ],
stderr:   "d_ff": 2048,
stderr:   "d_kv": 64,
stderr:   "d_model": 512,
stderr:   "decoder_start_token_id": 0,
stderr:   "dense_act_fn": "relu",
stderr:   "dropout_rate": 0.1,
stderr:   "eos_token_id": 1,
stderr:   "feed_forward_proj": "relu",
stderr:   "initializer_factor": 1.0,
stderr:   "is_encoder_decoder": true,
stderr:   "is_gated_act": false,
stderr:   "layer_norm_epsilon": 1e-06,
stderr:   "model_type": "t5",
stderr:   "n_positions": 512,
stderr:   "num_decoder_layers": 6,
stderr:   "num_heads": 8,
stderr:   "num_layers": 6,
stderr:   "output_past": true,
stderr:   "pad_token_id": 0,
stderr:   "relative_attention_max_distance": 128,
stderr:   "relative_attention_num_buckets": 32,
stderr:   "task_specific_params": {
stderr:     "summarization": {
stderr:       "early_stopping": true,
stderr:       "length_penalty": 2.0,
stderr:       "max_length": 200,
stderr:       "min_length": 30,
stderr:       "no_repeat_ngram_size": 3,
stderr:       "num_beams": 4,
stderr:       "prefix": "summarize: "
stderr:     },
stderr:     "translation_en_to_de": {
stderr:       "early_stopping": true,
stderr:       "max_length": 300,
stderr:       "num_beams": 4,
stderr:       "prefix": "translate English to German: "
stderr:     },
stderr:     "translation_en_to_fr": {
stderr:       "early_stopping": true,
stderr:       "max_length": 300,
stderr:       "num_beams": 4,
stderr:       "prefix": "translate English to French: "
stderr:     },
stderr:     "translation_en_to_ro": {
stderr:       "early_stopping": true,
stderr:       "max_length": 300,
stderr:       "num_beams": 4,
stderr:       "prefix": "translate English to Romanian: "
stderr:     }
stderr:   },
stderr:   "transformers_version": "4.29.0.dev0",
stderr:   "use_cache": true,
stderr:   "vocab_size": 32128
stderr: }
stderr: 
stderr: /home/sourab/transformers/src/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
stderr: For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
stderr: - Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
stderr: - If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
stderr: - To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
stderr:   warnings.warn(
stderr: /home/sourab/transformers/src/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr:   with safe_open(checkpoint_file, framework="pt") as f:
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr:   return self.fget.__get__(instance, owner)()
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr:   storage = cls(wrap_storage=untyped_storage)
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr:   with safe_open(filename, framework="pt", device=device) as f:
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file spiece.model from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/spiece.model
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file tokenizer.json from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/tokenizer.json
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file added_tokens.json from cache at None
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file special_tokens_map.json from cache at None
stderr: [INFO|tokenization_utils_base.py:1810] 2023-04-21 23:15:45,539 >> loading file tokenizer_config.json from cache at None
stderr: [INFO|configuration_utils.py:669] 2023-04-21 23:15:45,539 >> loading configuration file config.json from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/config.json
stderr: [INFO|configuration_utils.py:725] 2023-04-21 23:15:45,540 >> Model config T5Config {
stderr:   "_name_or_path": "t5-small",
stderr:   "architectures": [
stderr:     "T5ForConditionalGeneration"
stderr:   ],
stderr:   "d_ff": 2048,
stderr:   "d_kv": 64,
stderr:   "d_model": 512,
stderr:   "decoder_start_token_id": 0,
stderr:   "dense_act_fn": "relu",
stderr:   "dropout_rate": 0.1,
stderr:   "eos_token_id": 1,
stderr:   "feed_forward_proj": "relu",
stderr:   "initializer_factor": 1.0,
stderr:   "is_encoder_decoder": true,
stderr:   "is_gated_act": false,
stderr:   "layer_norm_epsilon": 1e-06,
stderr:   "model_type": "t5",
stderr:   "n_positions": 512,
stderr:   "num_decoder_layers": 6,
stderr:   "num_heads": 8,
stderr:   "num_layers": 6,
stderr:   "output_past": true,
stderr:   "pad_token_id": 0,
stderr:   "relative_attention_max_distance": 128,
stderr:   "relative_attention_num_buckets": 32,
stderr:   "task_specific_params": {
stderr:     "summarization": {
stderr:       "early_stopping": true,
stderr:       "length_penalty": 2.0,
stderr:       "max_length": 200,
stderr:       "min_length": 30,
stderr:       "no_repeat_ngram_size": 3,
stderr:       "num_beams": 4,
stderr:       "prefix": "summarize: "
stderr:     },
stderr:     "translation_en_to_de": {
stderr:       "early_stopping": true,
stderr:       "max_length": 300,
stderr:       "num_beams": 4,
stderr:       "prefix": "translate English to German: "
stderr:     },
stderr:     "translation_en_to_fr": {
stderr:       "early_stopping": true,
stderr:       "max_length": 300,
stderr:       "num_beams": 4,
stderr:       "prefix": "translate English to French: "
stderr:     },
stderr:     "translation_en_to_ro": {
stderr:       "early_stopping": true,
stderr:       "max_length": 300,
stderr:       "num_beams": 4,
stderr:       "prefix": "translate English to Romanian: "
stderr:     }
stderr:   },
stderr:   "transformers_version": "4.29.0.dev0",
stderr:   "use_cache": true,
stderr:   "vocab_size": 32128
stderr: }
stderr: 
stderr: /home/sourab/transformers/src/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
stderr: For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
stderr: - Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
stderr: - If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
stderr: - To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
stderr:   warnings.warn(
stderr: [INFO|modeling_t5.py:268] 2023-04-21 23:15:45,583 >> Discovered apex.normalization.FusedRMSNorm - will use it instead of T5LayerNorm
stderr: [INFO|modeling_utils.py:2534] 2023-04-21 23:15:45,585 >> loading weights file model.safetensors from cache at /home/sourab/.cache/huggingface/hub/models--t5-small/snapshots/5bf53e1f76b1430d9302d735c613c5f5677e32a6/model.safetensors
stderr: /home/sourab/transformers/src/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr:   with safe_open(checkpoint_file, framework="pt") as f:
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr:   return self.fget.__get__(instance, owner)()
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr:   storage = cls(wrap_storage=untyped_storage)
stderr: /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
stderr:   with safe_open(filename, framework="pt", device=device) as f:
stderr: [INFO|modeling_utils.py:2623] 2023-04-21 23:15:45,592 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
stderr: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
stderr: │ /home/sourab/transformers/examples/pytorch/translation/run_translation.py:666 in <module>        │
stderr: │                                                                                                  │
stderr: │   663                                                                                            │
stderr: │   664                                                                                            │
stderr: │   665 if __name__ == "__main__":                                                                 │
stderr: │ ❱ 666 │   main()                                                                                 │
stderr: │   667                                                                                            │
stderr: │                                                                                                  │
stderr: │ /home/sourab/transformers/examples/pytorch/translation/run_translation.py:378 in main            │
stderr: │                                                                                                  │
stderr: │   375 │   │   revision=model_args.model_revision,                                                │
stderr: │   376 │   │   use_auth_token=True if model_args.use_auth_token else None,                        │
stderr: │   377 │   )                                                                                      │
stderr: │ ❱ 378 │   model = AutoModelForSeq2SeqLM.from_pretrained(                                         │
stderr: │   379 │   │   model_args.model_name_or_path,                                                     │
stderr: │   380 │   │   from_tf=bool(".ckpt" in model_args.model_name_or_path),                            │
stderr: │   381 │   │   config=config,                                                                     │
stderr: │                                                                                                  │
stderr: │ /home/sourab/transformers/src/transformers/models/auto/auto_factory.py:468 in from_pretrained    │
stderr: │                                                                                                  │
stderr: │   465 │   │   │   )                                                                              │
stderr: │   466 │   │   elif type(config) in cls._model_mapping.keys():                                    │
stderr: │   467 │   │   │   model_class = _get_model_class(config, cls._model_mapping)                     │
stderr: │ ❱ 468 │   │   │   return model_class.from_pretrained(                                            │
stderr: │   469 │   │   │   │   pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs,   │
stderr: │   470 │   │   │   )                                                                              │
stderr: │   471 │   │   raise ValueError(                                                                  │
stderr: │                                                                                                  │
stderr: │ /home/sourab/transformers/src/transformers/modeling_utils.py:2624 in from_pretrained             │
stderr: │                                                                                                  │
stderr: │   2621 │   │   │   import deepspeed                                                              │
stderr: │   2622 │   │   │                                                                                 │
stderr: │   2623 │   │   │   logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this mode  │
stderr: │ ❱ 2624 │   │   │   init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())  │
stderr: │   2625 │   │   elif load_in_8bit or low_cpu_mem_usage:                                           │
stderr: │   2626 │   │   │   init_contexts.append(init_empty_weights())                                    │
stderr: │   2627                                                                                           │
stderr: │                                                                                                  │
stderr: │ /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_pa │
stderr: │ rameters.py:722 in __init__                                                                      │
stderr: │                                                                                                  │
stderr: │    719 │   │   │   config_dict_or_path = config                                                  │
stderr: │    720 │   │   │   logger.warning(                                                               │
stderr: │    721 │   │   │   │   f'zero.Init: the `config` argument is deprecated. Please use `config_dic  │
stderr: │ ❱  722 │   │   _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,        │
stderr: │    723 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │     mpu) if config_dict_or_pat  │
stderr: │    724 │   │   if _ds_config is not None:                                                        │
stderr: │    725 │   │   │   mem_efficient_linear = _ds_config.zero_config.memory_efficient_linear         │
stderr: │                                                                                                  │
stderr: │ /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/deepspeed/runtime/config.py:764 in  │
stderr: │ __init__                                                                                         │
stderr: │                                                                                                  │
stderr: │   761 │   │                                                                                      │
stderr: │   762 │   │   # Pass a copy so that user json is unmodified, e.g. for logging                    │
stderr: │   763 │   │   self._initialize_params(copy.copy(self._param_dict))                               │
stderr: │ ❱ 764 │   │   self._configure_train_batch_size()                                                 │
stderr: │   765 │   │   self._do_sanity_check()                                                            │
stderr: │   766 │                                                                                          │
stderr: │   767 │   def _initialize_params(self, param_dict):                                              │
stderr: │                                                                                                  │
stderr: │ /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/deepspeed/runtime/config.py:935 in  │
stderr: │ _configure_train_batch_size                                                                      │
stderr: │                                                                                                  │
stderr: │   932 │                                                                                          │
stderr: │   933 │   def _configure_train_batch_size(self):                                                 │
stderr: │   934 │   │   self._set_batch_related_parameters()                                               │
stderr: │ ❱ 935 │   │   self._batch_assertion()                                                            │
stderr: │   936 │                                                                                          │
stderr: │   937 │   def _do_sanity_check(self):                                                            │
stderr: │   938 │   │   self._do_error_check()                                                             │
stderr: │                                                                                                  │
stderr: │ /home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/deepspeed/runtime/config.py:883 in  │
stderr: │ _batch_assertion                                                                                 │
stderr: │                                                                                                  │
stderr: │   880 │   │                                                                                      │
stderr: │   881 │   │   assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be great   │
stderr: │   882 │   │                                                                                      │
stderr: │ ❱ 883 │   │   assert train_batch == micro_batch * grad_acc * self.world_size, (                  │
stderr: │   884 │   │   │   f"Check batch related parameters. train_batch_size is not equal "              │
stderr: │   885 │   │   │   "to micro_batch_per_gpu * gradient_acc_step * world_size "                     │
stderr: │   886 │   │   │   f"{train_batch} != {micro_batch} * {grad_acc} * {self.world_size}")            │
stderr: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
stderr: AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu
stderr: * gradient_acc_step * world_size 4 != 2 * 1 * 1
mrwyattii commented 1 year ago

@pacman100 / @muellerzr do either of you know when this test was last passing? I thought this might be related to changes we made in v0.9.0, but that does not seem to be the case. The test is failing with deepspeed==0.8.3 as well:

deepspeed                               0.8.3
torch                                   2.0.0.dev20230301+cu118
transformers                            4.29.0.dev0
pacman100 commented 1 year ago

@mrwyattii, I don't know that but looking at git blame I knew that it was an issue with earlier deepspeed version too but as the fix will be in the latest version, just reported that

mrwyattii commented 1 year ago

Got it, and in the branch that @muellerzr has created (muellerzr-bring-deepspeed-back) - the call to deepspeed.init_distributed is removed. Why is that done?

muellerzr commented 1 year ago

@mrwyattii this is because we're integrating with Accelerate to handle all the distributed code in Trainer. You can see the code we use to set everything up on the Accelerate side here: https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L112-L129 (Which that env var was an oversight, apologies!)

@pacman100 correct me if I'm wrong here, but with the accelerate integration if we're starting from python etc like we have, we need to use ACCELERATE_USE_DEEPSPEED="true" when launching the test, no? (To note, iirc when I did this it still failed, rerunning now)

muellerzr commented 1 year ago

Doing so will make tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_fp16 fail, with the same error as stated. Please try running with: CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" ACCELERATE_USE_DEEPSPEED="yes" pytest -sv tests/deepspeed/test_deepspeed.py -k test_basic_distributed to replicate

pacman100 commented 1 year ago

@muellerzr, yes but in this case it is not the cause. deepspeed.launcher.launch is initialising the dist setup by creating n processes (world_size=n), but the zero_init is trying to get the DS config which checks the train_batch validation before it updates its global dist.

mrwyattii commented 1 year ago

Doing so will make tests/deepspeed/test_deepspeed.py::TestDeepSpeedWithLauncher::test_basic_distributed_zero3_fp16 fail, with the same error as stated. Please try running with: CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW="yes" ACCELERATE_USE_DEEPSPEED="yes" pytest -sv tests/deepspeed/test_deepspeed.py -k test_basic_distributed to replicate

With ACCELERATE_USE_DEEPSPEED="yes" the test fails with the same batch size error, but I think that's because dist is never initialized. Setting ACCELERATE_USE_DEEPSPEED="true" will cause the DeepSpeed initialization to happen: https://github.com/huggingface/accelerate/blob/565152183334f709ac955204ef663023d1f63b7a/src/accelerate/state.py#L112

In this case, the error I see is with the torch.distirbuted.init_process_group call on line 121:

../venv/lib/python3.8/site-packages/accelerate/state.py:122: in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
../../../.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:899: in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
../../../.local/lib/python3.8/site-packages/torch/distributed/rendezvous.py:235: in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
../../../.local/lib/python3.8/site-packages/torch/distributed/rendezvous.py:220: in _get_env_or_raise
    raise _env_error(env_var)
E   ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

It seems the problem lies there. I don't think we have ever initialized dist from deepspeed.zero.Init - so I don't think this is a change on our side that caused this error.

pacman100 commented 1 year ago

@mrwyattii, how so, it is failing with previous version of transformers too as I posted above

mrwyattii commented 1 year ago

@pacman100 The tests pass for me with transformers==4.28.1 and on latest main. It looks like the breaking change was this PR where deepspeed.init_distributed was removed: https://github.com/huggingface/transformers/pull/22752/files#diff-bfceaff300c851b8e24fc50dc6638482abaec8f7d2a718e877c3828c166bcf79L1626

And then that change was reverted here: https://github.com/huggingface/transformers/pull/22899/files#diff-bfceaff300c851b8e24fc50dc6638482abaec8f7d2a718e877c3828c166bcf79R1554

pacman100 commented 1 year ago

Hello @mrwyattii, thank you for the pointers 😄. In my env, even though the pip list was showing transformers==4.28.1, it was actually using Zach's branch, which is weird. I can confirm that this isn't an issue with DeepSpeed and this PR of Accelerate https://github.com/huggingface/accelerate/pull/1352 should fix the issues with the trainer and DeepSpeed.

pacman100 commented 1 year ago

we can close this issue