unexpected lr scheduler behavior when using accelerate

Jasonlee1995 commented 6 months ago

System Info

OS version : Ubuntu 22.04.2 LTS
Python version : 3.11.8
Numpy version : 1.26.4
Torch version : 2.2.1
Accelerate version : 0.30.1
Accelerate’s configuration
    This machine
    multi-GPU
    1 node
    distributed operations check YES
    torch dynam NO
    DeepSpeed NO
    FullySharedDataParallel NO
    Megatron-LM NO
    4 GPUs
    [0,1,2,3]
    numa efficiency NO
    FP16 or BF16 no

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I used the test code test.py below on my machine with 4 A6000 GPUs.

I run NCCL_P2P_DISABLE=1 accelerate launch test.py on terminal. (if I do not use NCCL_P2P_DISABLE=1, the train doesn't work so I add it)

import torch
import torch.nn as nn
import torch.optim as optim

import transformers
from accelerate import Accelerator

class Trainer():
    def __init__(self):
        net = nn.Linear(10, 2)
        optimizer = optim.AdamW(net.parameters())

        self.num_warmup_steps = 8
        self.num_training_steps = 32
        lr_scheduler = transformers.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=self.num_warmup_steps, num_training_steps=self.num_training_steps)

        self.accelerator = Accelerator()
        self.net, self.optimizer, self.lr_scheduler = self.accelerator.prepare(net, optimizer, lr_scheduler)

    def train(self):
        for step in range(self.num_training_steps):
            self.lr_scheduler.step()
            self.accelerator.print(f"Step {step} - {self.lr_scheduler.get_last_lr()[-1]}")

        self.accelerator.print("Done")

if __name__ == "__main__":
    trainer = Trainer()
    trainer.train()

Expected behavior

Since I used warmup_steps=8, num_training_steps=32, I expect that I would get similar learning rate graph like above. (captured from huggingface Optimization)

But when I run and track the learning rate, they do not behave like expected.

Warmup steps are only 2, and cosine cycles are not working as I expected. (it behaves like all the gpus call the learning rate scheduler - warmup 8 / 4 = 2, 32 / 4 = 8)

Step 0 - 0.0005                                                                                                       
Step 1 - 0.001                                                                                                        
Step 2 - 0.0009330127018922195
Step 3 - 0.00075
Step 4 - 0.0005
Step 5 - 0.0002500000000000001
Step 6 - 6.698729810778065e-05
Step 7 - 0.0
Step 8 - 6.69872981077807e-05
Step 9 - 0.0002499999999999998
Step 10 - 0.0004999999999999999
Step 11 - 0.00075
Step 12 - 0.0009330127018922192
Step 13 - 0.001
Step 14 - 0.0009330127018922196
Step 15 - 0.0007499999999999999
Step 16 - 0.0005000000000000001
Step 17 - 0.0002500000000000004
Step 18 - 6.69872981077806e-05
Step 19 - 0.0
Step 20 - 6.698729810778043e-05
Step 21 - 0.0002500000000000001
Step 22 - 0.0004999999999999998
Step 23 - 0.0007499999999999996
Step 24 - 0.0009330127018922195
Step 25 - 0.001
Step 26 - 0.0009330127018922192
Step 27 - 0.0007500000000000007
Step 28 - 0.0005000000000000002
Step 29 - 0.00024999999999999973
Step 30 - 6.698729810778115e-05
Step 31 - 0.0
Done

Jasonlee1995 commented 5 months ago

~I found that on new docker container it works well I expected, but weird things happen when running on conda env.~

sorry for confusion, it behaves weird when using multi-gpus

muellerzr commented 5 months ago

Can you do a pip freeze in both for us please? :)

muellerzr commented 5 months ago

And the output of accelerate env (it tells us more than just what you entered when doing accelerate config! :))

Jasonlee1995 commented 5 months ago

Sorry for confusion 🥲 - I only tested on docker container running only on cpu, not multi-gpu :( If I test above test code on the docker container with multi-gpu, it behaves weird as I reported...

use nvcr.io/nvidia/pytorch:23.12-py3 image
apt-get update, apt-get upgrade
pip install transformers, accelerate, timm

running test code on cpu

``` Step 0 - 0.000125 Step 1 - 0.00025 Step 2 - 0.000375 Step 3 - 0.0005 Step 4 - 0.000625 Step 5 - 0.00075 Step 6 - 0.000875 Step 7 - 0.001 Step 8 - 0.0009957224306869053 Step 9 - 0.0009829629131445341 Step 10 - 0.0009619397662556434 Step 11 - 0.0009330127018922195 Step 12 - 0.0008966766701456176 Step 13 - 0.0008535533905932737 Step 14 - 0.0008043807145043603 Step 15 - 0.00075 Step 16 - 0.000691341716182545 Step 17 - 0.0006294095225512603 Step 18 - 0.000565263096110026 Step 19 - 0.0005 Step 20 - 0.00043473690388997434 Step 21 - 0.0003705904774487396 Step 22 - 0.0003086582838174551 Step 23 - 0.0002500000000000001 Step 24 - 0.00019561928549563967 Step 25 - 0.00014644660940672628 Step 26 - 0.00010332332985438247 Step 27 - 6.698729810778065e-05 Step 28 - 3.806023374435663e-05 Step 29 - 1.70370868554659e-05 Step 30 - 4.277569313094809e-06 Step 31 - 0.0 Done ```

running test code on multi-gpu (4 A6000)

``` Step 0 - 0.0005 Step 1 - 0.001 Step 2 - 0.0009330127018922195 Step 3 - 0.00075 Step 4 - 0.0005 Step 5 - 0.0002500000000000001 Step 6 - 6.698729810778065e-05 Step 7 - 0.0 Step 8 - 6.69872981077807e-05 Step 9 - 0.0002499999999999998 Step 10 - 0.0004999999999999999 Step 11 - 0.00075 Step 12 - 0.0009330127018922192 Step 13 - 0.001 Step 14 - 0.0009330127018922196 Step 15 - 0.0007499999999999999 Step 16 - 0.0005000000000000001 Step 17 - 0.0002500000000000004 Step 18 - 6.69872981077806e-05 Step 19 - 0.0 Step 20 - 6.698729810778043e-05 Step 21 - 0.0002500000000000001 Step 22 - 0.0004999999999999998 Step 23 - 0.0007499999999999996 Step 24 - 0.0009330127018922195 Step 25 - 0.001 Step 26 - 0.0009330127018922192 Step 27 - 0.0007500000000000007 Step 28 - 0.0005000000000000002 Step 29 - 0.00024999999999999973 Step 30 - 6.698729810778115e-05 Step 31 - 0.0 Done ```

I share my pip freeze results and accelerate env, and I'll also share the env if I solve this issue...

pip freeze results

``` absl-py==2.0.0 accelerate==0.30.1 aiohttp @ file:///rapids/aiohttp-3.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=81b77f868814346662c96ab36b875d7814ebf82340d3284a31681085c051320f aiosignal @ file:///rapids/aiosignal-1.3.1-py3-none-any.whl#sha256=f8376fb07dd1e86a584e4fcdec80b36b7f81aac666ebc724e2c090300dd83b17 annotated-types==0.6.0 apex @ file:///opt/pytorch/apex argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 asttokens==2.4.1 astunparse==1.6.3 async-timeout @ file:///rapids/async_timeout-4.0.3-py3-none-any.whl#sha256=7405140ff1230c310e51dc27b3145b9092d659ce68ff733fb0cefe3ee42be028 attrs==23.1.0 audioread==3.0.1 beautifulsoup4==4.12.2 bleach==6.1.0 blis==0.7.11 cachetools==5.3.2 catalogue==2.0.10 certifi==2023.11.17 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 cloudpathlib==0.16.0 cloudpickle @ file:///rapids/cloudpickle-3.0.0-py3-none-any.whl#sha256=246ee7d0c295602a036e86369c77fecda4ab17b506496730f2f576d9016fd9c7 cmake==3.27.9 comm==0.2.0 confection==0.1.4 contourpy==1.2.0 cubinlinker @ file:///rapids/cubinlinker-0.3.0%2B2.gbde7348-cp310-cp310-linux_x86_64.whl#sha256=18050212fa6f9df129a5f9c3baa2fa8f5896832ec33365bb0870514a5f849c5b cuda-python @ file:///rapids/cuda_python-12.3.0rc4%2B8.gcb4e395-cp310-cp310-linux_x86_64.whl#sha256=2f02a7d55ead96c111996fc7057ed880c319b6662176fe878e900678bbba5587 cudf @ file:///rapids/cudf-23.10.0-cp310-cp310-linux_x86_64.whl#sha256=a81c55b76c43a69748c053cf9a2f3911a7278a351a4e3a1e21812b52c1535ae2 cugraph @ file:///rapids/cugraph-23.10.0-cp310-cp310-linux_x86_64.whl#sha256=c4eca24c182cf314a1d896422fb8ba2c6a7ced6de1da51bf98f8e9ee52ccdee3 cugraph-dgl @ file:///rapids/cugraph_dgl-23.10.0-py3-none-any.whl#sha256=aed809adf2fdd84561ebecf88a92ed7c19bb9b1a29f2643f849453c87c5d6166 cugraph-service-client @ file:///rapids/cugraph_service_client-23.10.0-py3-none-any.whl#sha256=1d51aace2dc6369fe7e038096a21bad3edff7131bf07dfd5425dd05876bb4045 cugraph-service-server @ file:///rapids/cugraph_service_server-23.10.0-py3-none-any.whl#sha256=215ad06a86344a1bb6918b548e6b1e97ef18676ed3b9ce48ddaebc3511687997 cuml @ file:///rapids/cuml-23.10.0-cp310-cp310-linux_x86_64.whl#sha256=abd9fa367b86c6ee802df9006d8c74889aad2199efd5003831475ddf3b2b3c04 cupy-cuda12x @ file:///rapids/cupy_cuda12x-12.2.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=bfcea96e5506193ea8672a8c8a3e164d023c4860e58f1165cdd4a946b136aa20 cycler==0.12.1 cymem==2.0.8 Cython==3.0.6 dask @ file:///rapids/dask-2023.9.2-py3-none-any.whl#sha256=5742bc3752a8caad1c8bbb24aa2986424c73c4c819d5ac5a9544c6f373d0094b dask-cuda @ file:///rapids/dask_cuda-23.10.0-py3-none-any.whl#sha256=7fa5e333eee9869656b572ccbaf0bb393695ea422eb8a01e7675f16bb31a6124 dask-cudf @ file:///rapids/dask_cudf-23.10.0-py3-none-any.whl#sha256=129aa1bde1fd6fcc24c1e8cf4453927578b642b41bc6a81fc115df428c28e572 debugpy==1.8.0 decorator==5.1.1 defusedxml==0.7.1 distributed @ file:///rapids/distributed-2023.9.2-py3-none-any.whl#sha256=c2adf3b803599e8cff40034abf3dcd8470ef0678f6ae5a0d23c2ffd55fc0e0ca dm-tree==0.1.8 einops==0.7.0 exceptiongroup==1.2.0 execnet==2.0.2 executing==2.0.1 expecttest==0.1.3 fastjsonschema==2.19.0 fastrlock @ file:///rapids/fastrlock-0.8.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl#sha256=08315bde19d0c2e6b06593d5a418be3dc8f9b1ee721afa96867b9853fceb45cf filelock==3.13.1 flash-attn==2.0.4 fonttools==4.46.0 frozenlist @ file:///rapids/frozenlist-1.4.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=6918d49b1f90821e93069682c06ffde41829c346c66b721e65a5c62b4bab0300 fsspec @ file:///rapids/fsspec-2023.12.0-py3-none-any.whl#sha256=f807252ee2018f2223760315beb87a2166c2b9532786eeca9e6548dfcf2cfac9 gast==0.5.4 google-auth==2.25.0 google-auth-oauthlib==0.4.6 graphsurgeon @ file:///workspace/TensorRT-8.6.1.6/graphsurgeon/graphsurgeon-0.4.6-py2.py3-none-any.whl#sha256=0fbadaefbbe6e9920b9f814ae961c4a279be602812edf3ed7fb9cc6f8f4809fe grpcio==1.59.3 huggingface-hub==0.23.0 hypothesis==5.35.1 idna==3.6 importlib-metadata @ file:///rapids/importlib_metadata-7.0.0-py3-none-any.whl#sha256=d97503976bb81f40a193d41ee6570868479c69d5068651eb039c40d850c59d67 iniconfig==2.0.0 intel-openmp==2021.4.0 ipykernel==6.27.1 ipython==8.18.1 ipython-genutils==0.2.0 jedi==0.19.1 Jinja2==3.1.2 joblib==1.3.2 json5==0.9.14 jsonschema==4.20.0 jsonschema-specifications==2023.11.2 jupyter-tensorboard @ git+https://github.com/cliffwoolley/jupyter_tensorboard.git@ffa7e26138b82549453306e06b535a9ac36db17a jupyter_client==8.6.0 jupyter_core==5.5.0 jupyterlab==2.3.2 jupyterlab-server==1.2.0 jupyterlab_pygments==0.3.0 jupytext==1.16.0 kiwisolver==1.4.5 langcodes==3.3.0 lazy_loader==0.3 librosa==0.10.1 llvmlite @ file:///rapids/llvmlite-0.40.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=bbd5e82cc990e5a3e343a3bf855c26fdfe3bfae55225f00efd01c05bbda79918 locket @ file:///rapids/locket-1.0.0-py2.py3-none-any.whl#sha256=b6c819a722f7b6bd955b80781788e4a66a55628b858d347536b7e81325a3a5e3 Markdown==3.5.1 markdown-it-py==3.0.0 MarkupSafe==2.1.3 matplotlib==3.8.2 matplotlib-inline==0.1.6 mdit-py-plugins==0.4.0 mdurl==0.1.2 mistune==3.0.2 mkl==2021.1.1 mkl-devel==2021.1.1 mkl-include==2021.1.1 mock==5.1.0 mpmath==1.3.0 msgpack==1.0.7 multidict @ file:///rapids/multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=36c63aaa167f6c6b04ef2c85704e93af16c11d20de1d133e39de6a0e84582a93 murmurhash==1.0.10 nbclient==0.9.0 nbconvert==7.12.0 nbformat==5.9.2 nest-asyncio==1.5.8 networkx==2.6.3 ninja==1.11.1.1 notebook==6.4.10 numba @ file:///rapids/numba-0.57.1%2B1.g4157f3379-cp310-cp310-linux_x86_64.whl#sha256=84f9b8c0ad6b0dd5782d7de5d8ce5fc53a5c6e1a36442c2113c375b66e878ecb numpy==1.24.4 nvfuser==0.1.1+gitunknown nvidia-dali-cuda120==1.32.0 nvidia-pyindex==1.0.9 nvtx @ file:///rapids/nvtx-0.2.5-cp310-cp310-linux_x86_64.whl#sha256=aa2b03dacb7192f3b98bdc3ee07c481f7063b1c31aa18cdf6397f7f2bf93b52f oauthlib==3.2.2 onnx @ file:///opt/pytorch/pytorch/third_party/onnx opencv @ file:///opencv-4.7.0/modules/python/package optree==0.10.0 packaging==23.2 pandas @ file:///rapids/pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=7a0a56cef15fd1586726dace5616db75ebcfec9179a3a55e78f72c5639fa2a23 pandocfilters==1.5.0 parso==0.8.3 partd @ file:///rapids/partd-1.4.1-py3-none-any.whl#sha256=27e766663d36c161e2827aa3e28541c992f0b9527d3cca047e13fb3acdb989e6 pexpect==4.9.0 Pillow @ file:///tmp/pillow-simd platformdirs==4.1.0 pluggy==1.3.0 ply @ file:///rapids/ply-3.11-py2.py3-none-any.whl#sha256=096f9b8350b65ebd2fd1346b12452efe5b9607f7482813ffca50c22722a807ce polygraphy==0.49.1 pooch==1.8.0 preshed==3.0.9 prettytable==3.9.0 prometheus-client==0.19.0 prompt-toolkit==3.0.41 protobuf==4.24.4 psutil @ file:///rapids/psutil-5.9.4-cp310-abi3-linux_x86_64.whl#sha256=145af4db1981c11defa2252a795bddae0ced188305008e01a89c5388400465da ptxcompiler @ file:///rapids/ptxcompiler-0.8.1%2B2.g5ad1474-cp310-cp310-linux_x86_64.whl#sha256=e7eeb365a511737e15c2c7bf736f3b12ac548d152965417062075020ff386932 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow @ file:///rapids/pyarrow-12.0.1-cp310-cp310-linux_x86_64.whl#sha256=559bd26bbe32b60e42628b42f78059a295a3c7de09f18f8fe2bd7a421b76ec02 pyarrow-hotfix @ file:///rapids/pyarrow_hotfix-0.6-py3-none-any.whl#sha256=dcc9ae2d220dff0083be6a9aa8e0cdee5182ad358d4931fce825c545e5c89178 pyasn1==0.5.1 pyasn1-modules==0.3.0 pybind11==2.11.1 pybind11-global==2.11.1 pycocotools @ git+https://github.com/nvidia/cocoapi.git@d99cbf3823588ef09a2721655f46e509ebafb3d7#subdirectory=PythonAPI pycparser==2.21 pydantic==2.5.2 pydantic_core==2.14.5 Pygments==2.17.2 pylibcugraph @ file:///rapids/pylibcugraph-23.10.0-cp310-cp310-linux_x86_64.whl#sha256=f26ab38ca507775eb9051f421aecace530cbcdb5f5dc90acfa56afc739415195 pylibcugraphops @ file:///rapids/pylibcugraphops-23.10.0-cp310-cp310-linux_x86_64.whl#sha256=1950858bc191ba5023b703ae85b45541312a3725d32e2651c74e0f50ab06e4dc pylibraft @ file:///rapids/pylibraft-23.10.0-cp310-cp310-linux_x86_64.whl#sha256=2779e59bbf6240c73610bda62f62fbfa3004e00b9247fd351b9ae26b17385e4c pynvml @ file:///rapids/pynvml-11.4.1-py3-none-any.whl#sha256=d27be542cd9d06558de18e2deffc8022ccd7355bc7382255d477038e7e424c6c pyparsing==3.1.1 pytest==7.4.3 pytest-flakefinder==1.1.0 pytest-rerunfailures==13.0 pytest-shard==0.1.2 pytest-xdist==3.5.0 python-dateutil==2.8.2 python-hostlist==1.23.0 pytorch-quantization==2.1.2 pytz @ file:///rapids/pytz-2023.3.post1-py2.py3-none-any.whl#sha256=ce42d816b81b68506614c11e8937d3aa9e41007ceb50bfdcb0749b921bf646c7 PyYAML==6.0.1 pyzmq==25.1.2 raft-dask @ file:///rapids/raft_dask-23.10.0-cp310-cp310-linux_x86_64.whl#sha256=9f9c1de0a183a1a81d34456500f375230cb5eb2e96b64cf19b829631f2cbe96f referencing==0.31.1 regex==2023.10.3 requests==2.31.0 requests-oauthlib==1.3.1 rich @ file:///rapids/rich-13.7.0-py3-none-any.whl#sha256=6da14c108c4866ee9520bbffa71f6fe3962e193b7da68720583850cd4548e235 rmm @ file:///rapids/rmm-23.10.0-cp310-cp310-linux_x86_64.whl#sha256=90295d50f74cea09df0d4baa04695fa03309c031fa1d5730d8472f85b02c3111 rpds-py==0.13.2 rsa==4.9 safetensors==0.4.3 scikit-learn @ file:///rapids/scikit_learn-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=184a42842a4e698ffa4d849b6019de50a77a0aa24d26afa28fa49c9190bb144b scipy @ file:///rapids/scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=933baf588daa8dc9a92c20a0be32f56d43faf3d1a60ab11b3f08c356430f6e56 Send2Trash==1.8.2 six==1.16.0 smart-open==6.4.0 sortedcontainers==2.4.0 soundfile==0.12.1 soupsieve==2.5 soxr==0.3.7 spacy==3.7.2 spacy-legacy==3.0.12 spacy-loggers==1.0.5 sphinx-glpi-theme==0.4.1 srsly==2.4.8 stack-data==0.6.3 sympy==1.12 tabulate==0.9.0 tbb==2021.11.0 tblib @ file:///rapids/tblib-3.0.0-py3-none-any.whl#sha256=80a6c77e59b55e83911e1e607c649836a69c103963c5f28a46cbeef44acf8129 tensorboard==2.9.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorrt @ file:///workspace/TensorRT-8.6.1.6/python/tensorrt-8.6.1-cp310-none-linux_x86_64.whl#sha256=2684b4772cb16088184266728a0668f5dac14e66f088c4ccff2096ccb222d74c terminado==0.18.0 thinc==8.2.1 threadpoolctl==3.2.0 thriftpy2 @ file:///rapids/thriftpy2-0.4.17-cp310-cp310-linux_x86_64.whl#sha256=afa1d5df1592184e05e19e152c885d1b675acb37e9783973b863f3544c98f7c2 timm==1.0.3 tinycss2==1.2.1 tokenizers==0.19.1 toml==0.10.2 tomli==2.0.1 toolz @ file:///rapids/toolz-0.12.0-py3-none-any.whl#sha256=2059bd4148deb1884bb0eb770a3cde70e7f954cfbbdc2285f1f2de01fd21eb6f torch @ file:///tmp/pip/torch-2.2.0a0%2B81ea7a4-cp310-cp310-linux_x86_64.whl#sha256=273a6e313ccb3ccaa03a13250b96ef837b8c07b6aad7ebe763450b2b71351400 torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/dist/torch_tensorrt-2.2.0a0-cp310-cp310-linux_x86_64.whl#sha256=b5440ed44884cb3e4b5658f8882fccbd5b2c53d8bbd289fcc75090a77a83eecb torchdata @ file:///opt/pytorch/data torchtext @ file:///opt/pytorch/text torchvision @ file:///opt/pytorch/vision tornado==6.4 tqdm==4.66.1 traitlets==5.9.0 transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@cf6fc898286e4ad347ff88925c88663324e2b87d transformers==4.40.2 treelite @ file:///rapids/treelite-3.9.1-cp310-cp310-linux_x86_64.whl#sha256=26a602988e2e451c205ad97f8b0d6959c2d8f026e43607cdc5161c0be3ce0c09 treelite-runtime @ file:///rapids/treelite_runtime-3.9.1-cp310-cp310-linux_x86_64.whl#sha256=1c7033d1ce507a98d54f36bd52b3878561822b9aefd3adfd11c1933991026e67 triton @ file:///tmp/dist/triton-2.1.0%2B6e4932c-cp310-cp310-linux_x86_64.whl#sha256=b39f5d8e8aee599c6e55e50cb8cfc531c36b4e87fa3adda7931626bd75544718 typer==0.9.0 types-dataclasses==0.6.6 typing_extensions==4.8.0 ucx-py @ file:///rapids/ucx_py-0.34.0-cp310-cp310-linux_x86_64.whl#sha256=5bb2a27859f16d8df033f87b95bf46ad475e1b01f29902587ab32d045368ab5c uff @ file:///workspace/TensorRT-8.6.1.6/uff/uff-0.6.9-py2.py3-none-any.whl#sha256=618a3f812d491f0d3c4f2e38b99e03217ca37b206db14cee079f2bf681eb4fe3 urllib3 @ file:///rapids/urllib3-1.26.18-py2.py3-none-any.whl#sha256=34b97092d7e0a3a8cf7cd10e386f401b3737364026c45e622aa02903dffe0f07 wasabi==1.1.2 wcwidth==0.2.12 weasel==0.3.4 webencodings==0.5.1 Werkzeug==3.0.1 xdoctest==1.0.2 xgboost @ file:///rapids/xgboost-1.7.6-cp310-cp310-linux_x86_64.whl#sha256=a0902b19d0dc86d70b8bf06be41380a9e36e64647962e3ad7cd97603e97d168e yarl @ file:///rapids/yarl-1.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=c5f3faeb8100a43adf3e7925d556801d14b5816a0ac9e75e22948e787feec642 zict @ file:///rapids/zict-3.0.0-py2.py3-none-any.whl#sha256=5796e36bd0e0cc8cf0fbc1ace6a68912611c1dbd74750a3f3026b9b9d6a327ae zipp @ file:///rapids/zipp-3.17.0-py3-none-any.whl#sha256=0e923e726174922dce09c53c59ad483ff7bbb8e572e00c7f7c46b88556409f31 ```

accelerate env

``` Copy-and-paste the text below in your GitHub issue - `Accelerate` version: 0.30.1 - Platform: Linux-5.19.0-42-generic-x86_64-with-glibc2.35 - `accelerate` bash location: /usr/local/bin/accelerate - Python version: 3.10.12 - Numpy version: 1.24.4 - PyTorch version (GPU?): 2.2.0a0+81ea7a4 (True) - PyTorch XPU available: False - PyTorch NPU available: False - PyTorch MLU available: False - System RAM: 503.54 GB - GPU type: NVIDIA RTX A6000 - `Accelerate` default config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: no - use_cpu: False - debug: True - num_processes: 4 - machine_rank: 0 - num_machines: 1 - gpu_ids: 0,1,2,3 - rdzv_backend: static - same_network: True - main_training_function: main - enable_cpu_affinity: True - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] ```

Jasonlee1995 commented 5 months ago

At first, I think my machine is weird, but I also run the same test code on other machine with 3090 x 2 GPU, the same problem occurs (lr scheduler behaves as each gpu calls)

I think I'm missing something (is there any possible bugs in my test code???)

running test code on multi-gpu (3090 x 2)

``` Step 0 - 0.00025 Step 1 - 0.0005 Step 2 - 0.00075 Step 3 - 0.001 Step 4 - 0.0009829629131445341 Step 5 - 0.0009330127018922195 Step 6 - 0.0008535533905932737 Step 7 - 0.00075 Step 8 - 0.0006294095225512603 Step 9 - 0.0005 Step 10 - 0.0003705904774487396 Step 11 - 0.0002500000000000001 Step 12 - 0.00014644660940672628 Step 13 - 6.698729810778065e-05 Step 14 - 1.70370868554659e-05 Step 15 - 0.0 Step 16 - 1.703708685546579e-05 Step 17 - 6.69872981077807e-05 Step 18 - 0.00014644660940672617 Step 19 - 0.0002499999999999998 Step 20 - 0.0003705904774487397 Step 21 - 0.0004999999999999999 Step 22 - 0.0006294095225512601 Step 23 - 0.00075 Step 24 - 0.0008535533905932737 Step 25 - 0.0009330127018922192 Step 26 - 0.0009829629131445341 Step 27 - 0.001 Step 28 - 0.0009829629131445341 Step 29 - 0.0009330127018922196 Step 30 - 0.0008535533905932738 Step 31 - 0.0007499999999999999 Done ```

pip freeze results

``` accelerate==0.30.1 asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1670263926556/work backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1687772187254/work certifi==2022.12.7 charset-normalizer==2.1.1 cmake==3.25.0 contourpy==1.1.0 cycler==0.11.0 debugpy @ file:///home/builder/ci_310/debugpy_1640789504635/work decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1667317341051/work filelock==3.9.0 fonttools==4.41.0 fsspec==2024.5.0 huggingface-hub==0.23.0 idna==3.4 ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1655369107642/work ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1685727741709/work jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1669134318875/work Jinja2==3.1.2 jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1654730843242/work jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1686775611663/work kiwisolver==1.4.4 lit==15.0.7 MarkupSafe==2.1.2 matplotlib==3.7.2 matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work mpmath==1.2.1 nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1664684991461/work networkx==3.0 numpy==1.24.1 packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1681337016113/work pandas==2.0.3 parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1667297516076/work pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work Pillow==9.3.0 platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1687705014305/work prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1688565951714/work psutil @ file:///opt/conda/conda-bld/psutil_1656431268089/work ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1681904169130/work pyparsing==3.0.9 python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work pytz==2023.3 PyYAML==6.0.1 pyzmq @ file:///croot/pyzmq_1686601365461/work regex==2024.5.15 requests==2.28.1 safetensors==0.4.3 scipy==1.11.1 seaborn==0.12.2 six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work sympy==1.11.1 tokenizers==0.19.1 torch==2.0.0+cu118 torchaudio==2.0.1+cu118 torchvision==0.15.1+cu118 tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1648827254365/work tqdm==4.66.1 traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1675110562325/work transformers==4.40.2 triton==2.0.0 typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1688315532570/work tzdata==2023.3 urllib3==1.26.13 wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1673864653149/work ```

accelerate env

``` - `Accelerate` version: 0.30.1 - Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.27 - `accelerate` bash location: /home/jslee/miniconda3/envs/pytorch2/bin/accelerate - Python version: 3.10.11 - Numpy version: 1.24.1 - PyTorch version (GPU?): 2.0.0+cu118 (True) - PyTorch XPU available: False - PyTorch NPU available: False - PyTorch MLU available: False - System RAM: 125.78 GB - GPU type: NVIDIA GeForce RTX 3090 - `Accelerate` default config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: no - use_cpu: False - debug: True - num_processes: 2 - machine_rank: 0 - num_machines: 1 - gpu_ids: 0,1 - rdzv_backend: static - same_network: True - main_training_function: main - enable_cpu_affinity: True - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] ```

muellerzr commented 5 months ago

Oh wait. Just reread this issue.

(it behaves like all the gpus call the learning rate scheduler - warmup 8 / 4 = 2, 32 / 4 = 8)

yes. That’s exactly how our scheduler wrapper behaves (and how you should step in multi-GPU)

Jasonlee1995 commented 5 months ago

yes. That’s exactly how our scheduler wrapper behaves (and how you should step in multi-GPU)

I hope adding this information on docs or somewhere maybe useful for newbie like me! (I keep thought about what parts of my machine is causing these errors)

I looked at accelerate cv example, accelerate nlp example, transformers scheduler docs, accelerate tutorials but hard to get to the conclusion 🥲

muellerzr commented 5 months ago

Accelerate works by splitting up the dataloader between all GPUs, so one epoch is faster (every GPU sees a different subset). In this same vein, we can then also increase the LR by n-gpu steps since we are doing this at once.

Check out the debugging guide which helps talk about this: https://huggingface.co/docs/accelerate/concept_guides/performance

huggingface / accelerate