IndexError: pop from an empty deque

c3ianwu commented 1 week ago

System Info

Code based on a forked version of trl.

Package Version

accelerate 0.33.0 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.0.5 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.44.1 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 cloudpickle 3.1.0 comm 0.2.1 compressed-tensors 0.6.0 contourpy 1.3.0 cycler 0.12.1 datasets 3.0.2 debugpy 1.6.7 decorator 5.1.1 deepspeed 0.15.3 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 docker-pycreds 0.4.0 docstring_parser 0.16 einops 0.8.0 exceptiongroup 1.2.0 executing 0.8.3 fastapi 0.115.4 filelock 3.16.1 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.9.0 gguf 0.10.0 gitdb 4.0.11 GitPython 3.1.43 h11 0.14.0 hjson 3.1.0 httpcore 1.0.6 httptools 0.6.4 httpx 0.27.2 huggingface-hub 0.26.1 idna 3.10 importlib_metadata 8.5.0 interegular 0.3.3 ipykernel 6.29.5 ipython 8.27.0 jedi 0.19.1 Jinja2 3.1.4 jiter 0.6.1 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 jupyter_client 8.6.0 jupyter_core 5.7.2 kiwisolver 1.4.7 lark 1.2.2 llvmlite 0.43.0 lm-format-enforcer 0.10.6 markdown-it-py 3.0.0 MarkupSafe 3.0.2 matplotlib 3.9.2 matplotlib-inline 0.1.6 mdurl 0.1.2 mistral_common 1.4.4 mpmath 1.3.0 msgpack 1.1.0 msgspec 0.18.6 multidict 6.1.0 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.4.2 ninja 1.11.1.1 numba 0.60.0 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 openai 1.52.2 opencv-python-headless 4.10.0.84 outlines 0.0.46 packaging 24.1 pandas 2.2.3 parso 0.8.3 partial-json-parser 0.2.1.1.post4 peft 0.13.2 pexpect 4.8.0 pillow 10.4.0 pip 24.2 platformdirs 3.10.0 prometheus_client 0.21.0 prometheus-fastapi-instrumentator 7.0.0 prompt-toolkit 3.0.43 propcache 0.2.0 protobuf 5.28.3 psutil 5.9.0 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 17.0.0 pycountry 24.6.1 pydantic 2.9.2 pydantic_core 2.23.4 Pygments 2.15.1 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 pytz 2024.2 PyYAML 6.0.2 pyzmq 25.1.2 ray 2.38.0 referencing 0.35.1 regex 2024.9.11 requests 2.32.3 rich 13.9.3 rpds-py 0.20.0 safetensors 0.4.5 sentencepiece 0.2.0 sentry-sdk 2.17.0 setproctitle 1.3.3 setuptools 75.1.0 shtab 1.7.1 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.2.0 starlette 0.41.2 sympy 1.13.1 tiktoken 0.7.0 tokenizers 0.20.1 torch 2.4.0 torchvision 0.19.0 tornado 6.4.1 tqdm 4.66.5 traitlets 5.14.3 transformers 4.46.0 triton 3.0.0 typing_extensions 4.11.0 tyro 0.8.14 tzdata 2024.2 urllib3 2.2.3 uvicorn 0.32.0 uvloop 0.21.0 vllm 0.6.3.post1 wandb 0.18.5 watchfiles 0.24.0 wcwidth 0.2.5 websockets 13.1 wheel 0.44.0 xformers 0.0.27.post2 xxhash 3.5.0 yarl 1.16.0 zipp 3.20.2

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[X] My own task or dataset (give details below)

Reproduction

Running into this error when using PPOv2.

[rank6]:     accelerator.backward(loss)
[rank6]:   File "/root/miniconda3/envs/trl_custom/lib/python3.10/site-packages/accelerate/accelerator.py", line 2151, in backward
[rank6]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank6]:   File "/root/miniconda3/envs/trl_custom/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank6]:     self.engine.backward(loss, **kwargs)
[rank6]:   File "/root/miniconda3/envs/trl_custom/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/root/miniconda3/envs/trl_custom/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
[rank6]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank6]:   File "/root/miniconda3/envs/trl_custom/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank6]:     ret_val = func(*args, **kwargs)
[rank6]:   File "/root/miniconda3/envs/trl_custom/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2252, in backward
[rank6]:     self._get_param_coordinator(training=True).reset_step()
[rank6]:   File "/root/miniconda3/envs/trl_custom/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 222, in reset_step
[rank6]:     self.construct_parameter_trace_from_module_trace()
[rank6]:   File "/root/miniconda3/envs/trl_custom/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 210, in construct_parameter_trace_from_module_trace
[rank6]:     self.record_parameters(sub_module)
[rank6]:   File "/root/miniconda3/envs/trl_custom/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 202, in record_parameters
[rank6]:     step_id = self.__step_id_module_fetched_for[sub_module.id].popleft()
[rank6]: IndexError: pop from an empty deque

As this is based on my own modified version of trl I realise you might not be of much help. However it does seem like this error has been seen before by other users.

To help me debug, can someone tell me why this error occurs, and what some situations are that this arises?

I was able to run PPOv2 until my most recent changes. I'm trying to figure out what change is leading to this. Any help would be greatly appreciated!

Expected behavior

No error.

c3ianwu commented 1 week ago

Ok I think I have found the issue.

In my code I was calling with unwrap_model_for_generation(model, accelerator) as unwrapped_model multiple times in different places, opening and closing with the context manager e.g.

with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
    # generate A

# do some stuff

with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
    # generate B

# do more stuff

This problem seems to go away by just calling the context manager once:

with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
    # generate A
    # do some stuff
    # generate B
    # do more stuff

qgallouedec commented 1 week ago

Code based on a forked version of trl.

As this is based on my own modified version of trl I realise you might not be of much help

Indeed. Unfortunately we don't have time for tech support. The most we can do is help with the original codebase.

Great that you've found the solution! Thanks for sharing it!

huggingface / trl