huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
12.57k stars 1.7k forks source link

`deepspeed zero stage 3` does not affect the speed and VRAM consumption at all #2784

Closed JohnConnor123 closed 1 month ago

JohnConnor123 commented 1 month ago

Reproduction

deepspeed zero stage 3 does not affect the speed and VRAM consumption at all - 10695MiB was consumed with deepspeed zero stage 3, and it is consumed without it.

script:

accelerate launch --config_file deepspeed_zero3.yaml \
    trl/examples/scripts/ppo/ppo.py \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --dataset_train_split descriptiveness \
    --output_dir Qwen2.5-0.5B-Instruct-PPO \
    --num_ppo_epochs 1 \
    --num_mini_batches 1 \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --total_episodes 1000 \
    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --sft_model_path Qwen/Qwen2.5-0.5B-Instruct \
    --reward_model_path Qwen/Qwen2.5-0.5B-Instruct \
    --local_rollout_forward_batch_size 1 \
    --missing_eos_penalty 1.0 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \

config_file:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
distributed_type: 'NO'

System Info

(research-llm-long-context-py3.12) calibri@devai:~/experiments/rl_finetunning$ trl env [2025-02-06 16:22:10,920] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Copy-paste the following information when reporting an issue:

Checklist

Superskyyy commented 1 month ago

I believe this is mainly because the Model size is not the main bottleneck. So even with Deepspeed zero3 there isn't a visible gain. (it's just 0.5B shared across 8 cards vs each card hold 0.5b there isn't much difference). Other parts in the PPO setup requires way more HBM than the model itself.

JohnConnor123 commented 1 month ago

I believe this is mainly because the Model size is not the main bottleneck. So even with Deepspeed zero3 there isn't a visible gain. (it's just 0.5B shared across 8 cards vs each card hold 0.5b there isn't much difference). Other parts in the PPO setup requires way more HBM than the model itself.

When using a model with peft, consumption decreases from 16 GB to 10 GB. Considering the weight of the 970 MB model and the creation of 4 copies for training ppo, it turns out that everything else (optimizer states, gradients, etc.) weighs ~ 10-4 * 0.97 = 6 GB or 6/4 = 1.5 GB for each model (the weight of which is ~ 1 GB). So there are reasons to expect a decrease in VRAM consumption.

P.s. I have one video card, not 8 pcs.

JohnConnor123 commented 1 month ago

@Superskyyy correct me if i'm wrong

Superskyyy commented 1 month ago

I believe this is mainly because the Model size is not the main bottleneck. So even with Deepspeed zero3 there isn't a visible gain. (it's just 0.5B shared across 8 cards vs each card hold 0.5b there isn't much difference). Other parts in the PPO setup requires way more HBM than the model itself.

When using a model with peft, consumption decreases from 16 GB to 10 GB. Considering the weight of the 970 MB model and the creation of 4 copies for training ppo, it turns out that everything else (optimizer states, gradients, etc.) weighs ~ 10-4 * 0.97 = 6 GB or 6/4 = 1.5 GB for each model (the weight of which is ~ 1 GB). So there are reasons to expect a decrease in VRAM consumption.

P.s. I have one video card, not 8 pcs.

Your accelerate config shows num_process as 8, if you have one GPU, are you trying to have 8 shards within the same GPU? It sounds like not the intended usage of deepspeed though. Deepspeed stage 0-3 are only useful when GPU > 1

JohnConnor123 commented 1 month ago

It sounds like not the intended usage of deepspeed though. Deepspeed stage 0-3 are only useful when GPU > 1

Ah, sorry. I thought num_processes was responsible for the number of CPU threads used during execution.

I was planning to use deepspeed to move the calculations of everything except the model to the cpu. Apparently my config does not do this. How should I change it then to make this happen (if it is possible)?

Superskyyy commented 1 month ago

It sounds like not the intended usage of deepspeed though. Deepspeed stage 0-3 are only useful when GPU > 1

Ah, sorry. I thought num_processes was responsible for the number of CPU threads used during execution.

I was planning to use deepspeed to move the calculations of everything except the model to the cpu. Apparently my config does not do this. How should I change it then to make this happen (if it is possible)?

Right num process is really num GPUs, if you want that effect I guess you can try zero stage 0, enable parameter and optimizer offload with pinned memory, with num process 1? Idk if it will work though. You can use gradient checkpointing too (expect a 20% slow down). Alternatively you can try PEFT if the hardware is the main concern, offloading more stuff = slower training.

JohnConnor123 commented 1 month ago

Alternatively you can try PEFT

Unfortunately, I'm already using LoRA, so I have no other ways to reduce VRAM consumption (0.5B is a model for debugging, my goal is 3B models, so the VRAM issue is important).

TLDR: I almost managed to run ppo using only 6GB of VRAM instead of 10GB (to do this I had to roll back the accelerate version to 0.34.2 and correct the config) and with almost same speed, but the error IndexError: pop from an empty deque appears after some time. On the latest version of accelerate, the error [rank0]: ValueError: Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` before using any functionality from the `accelerate` library. [rank0]:[W207 01:12:34.028098829 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) E0207 01:12:35.515000 130698066202752 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 661843) of binary: /home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/python occurs.

start-ppo-with-deepspeed.sh:

accelerate launch --config_file single_gpu.yaml \
    trl/examples/scripts/ppo/ppo.py \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --dataset_train_split descriptiveness \
    --output_dir Qwen2.5-0.5B-Instruct-PPO \
    --num_ppo_epochs 1 \
    --num_mini_batches 1 \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --total_episodes 1000 \
    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --sft_model_path Qwen/Qwen2.5-0.5B-Instruct \
    --reward_model_path Qwen/Qwen2.5-0.5B-Instruct \
    --local_rollout_forward_batch_size 1 \
    --missing_eos_penalty 1.0 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16 \

single_gpu.yaml:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

With accelerate:latest (==1.3.0):

(skoltech-llm-long-context-py3.12) calibri@devai:~/experiments/rl_finetunning$ source start-ppo-with-deepspeed.sh
[2025-02-07 01:12:20,356] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-07 01:12:23,879] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-07 01:12:24,770] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-07 01:12:24,771] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-02-07 01:12:25,947] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:12:26,727] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 0.49B
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2025-02-07 01:12:27,329] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:12:27,995] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 582, num_elems = 0.99B
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2025-02-07 01:12:28,826] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:12:29,486] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 873, num_elems = 1.62B
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/calibri/experiments/rl_finetunning/trl/examples/scripts/ppo/ppo.py", line 152, in <module>
[rank0]:     trainer = PPOTrainer(
[rank0]:               ^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   [Previous line repeated 1 more time]
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/ppo_trainer.py", line 194, in __init__
[rank0]:     accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/accelerator.py", line 302, in __init__
[rank0]:     deepspeed_plugins = AcceleratorState().deepspeed_plugins
[rank0]:                         ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/state.py", line 887, in __init__
[rank0]:     raise ValueError(
[rank0]: ValueError: Please make sure to properly initialize your accelerator via `accelerator = Accelerator()` before using any functionality from the `accelerate` library.
[rank0]:[W207 01:12:34.028098829 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
E0207 01:12:35.515000 130698066202752 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 661843) of binary: /home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/python
Traceback (most recent call last):
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1157, in launch_command
    deepspeed_launcher(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/launch.py", line 845, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
trl/examples/scripts/ppo/ppo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-07_01:12:35
  host      : devai
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 661843)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

With accelerate==0.34.2:

(skoltech-llm-long-context-py3.12) calibri@devai:~/experiments/rl_finetunning$ source start-ppo-with-deepspeed.sh
[2025-02-07 01:13:21,262] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-07 01:13:24,750] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-07 01:13:25,644] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-07 01:13:25,644] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-02-07 01:13:26,568] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:27,351] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 0.49B
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2025-02-07 01:13:27,973] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:28,645] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 582, num_elems = 0.99B
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2025-02-07 01:13:29,418] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:30,081] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 873, num_elems = 1.62B
Using /home/calibri/.cache/torch_extensions/py312_cu121 as PyTorch extensions root...
Emitting ninja build file /home/calibri/.cache/torch_extensions/py312_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.179290294647217 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000003, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2025-02-07 01:13:37,954] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-02-07 01:13:37,954] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:37,973] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-02-07 01:13:37,974] [INFO] [logging.py:128:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2025-02-07 01:13:37,974] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-02-07 01:13:37,995] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2025-02-07 01:13:37,995] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2025-02-07 01:13:37,995] [INFO] [logging.py:128:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2025-02-07 01:13:37,995] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2025-02-07 01:13:38,133] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2025-02-07 01:13:38,134] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.76 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:38,134] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.32 GB, percent = 17.0%
[2025-02-07 01:13:38,137] [INFO] [stage3.py:169:__init__] Reduce bucket size 500000000
[2025-02-07 01:13:38,137] [INFO] [stage3.py:170:__init__] Prefetch bucket size 50000000
[2025-02-07 01:13:38,263] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-02-07 01:13:38,263] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:38,263] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.32 GB, percent = 17.0%
Parameter Offload: Total persistent parameters: 2306688 in 339 params
[2025-02-07 01:13:38,449] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-02-07 01:13:38,450] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:38,450] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.32 GB, percent = 17.0%
[2025-02-07 01:13:38,585] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2025-02-07 01:13:38,586] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:38,586] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.31 GB, percent = 17.0%
[2025-02-07 01:13:39,436] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2
[2025-02-07 01:13:39,437] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:39,437] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.23 GB, percent = 19.9%
[2025-02-07 01:13:39,579] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2025-02-07 01:13:39,579] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:39,579] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 6.21 GB, percent = 19.9%
[2025-02-07 01:13:40,037] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2025-02-07 01:13:40,037] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:40,038] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 8.07 GB, percent = 25.8%
[2025-02-07 01:13:40,175] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-02-07 01:13:40,176] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:40,176] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 8.07 GB, percent = 25.8%
[2025-02-07 01:13:40,689] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-02-07 01:13:40,689] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.77 GB         Max_CA 1 GB
[2025-02-07 01:13:40,689] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 9.88 GB, percent = 31.6%
[2025-02-07 01:13:40,690] [INFO] [stage3.py:529:_setup_for_real_optimizer] optimizer state initialized
[2025-02-07 01:13:41,060] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-02-07 01:13:41,060] [INFO] [utils.py:782:see_memory_usage] MA 0.93 GB         Max_MA 1.44 GB         CA 1.7 GB         Max_CA 2 GB
[2025-02-07 01:13:41,060] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 10.65 GB, percent = 34.1%
[2025-02-07 01:13:41,061] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2025-02-07 01:13:41,061] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None
[2025-02-07 01:13:41,061] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2025-02-07 01:13:41,061] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-06, 3e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2025-02-07 01:13:41,062] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x78ba035a7d70>
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-02-07 01:13:41,063] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false,
    "recompute_fwd_factor": 0.0,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 1
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-02-07 01:13:41,064] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   optimizer_name ............... None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   optimizer_params ............. None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   train_batch_size ............. 1
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... False
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   world_size ................... 1
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-02-07 01:13:41,065] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 3
[2025-02-07 01:13:41,065] [INFO] [config.py:989:print_user_config]   json = {
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "nvme_path": null
        },
        "offload_param": {
            "device": "cpu",
            "nvme_path": null
        },
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "zero_allow_untested_optimizer": true
}
[2025-02-07 01:13:41,066] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-02-07 01:13:41,066] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-07 01:13:41,070] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-02-07 01:13:41,071] [INFO] [logging.py:128:log_dist] [Rank 0] Creating ZeRO Offload
[2025-02-07 01:13:41,211] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-02-07 01:13:41,212] [INFO] [utils.py:782:see_memory_usage] MA 0.93 GB         Max_MA 0.93 GB         CA 1.7 GB         Max_CA 2 GB
[2025-02-07 01:13:41,212] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 10.65 GB, percent = 34.1%
Parameter Offload: Total persistent parameters: 72448 in 122 params
[2025-02-07 01:13:41,360] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-02-07 01:13:41,361] [INFO] [utils.py:782:see_memory_usage] MA 0.93 GB         Max_MA 0.93 GB         CA 1.7 GB         Max_CA 2 GB
[2025-02-07 01:13:41,361] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 10.65 GB, percent = 34.1%
[2025-02-07 01:13:41,362] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   amp_enabled .................. False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   amp_params ................... False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   bfloat16_enabled ............. True
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   bfloat16_immediate_grad_update  False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   checkpoint_parallel_write_pipeline  False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   checkpoint_tag_validation_enabled  True
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   checkpoint_tag_validation_fail  False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x78b8ddb07260>
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   communication_data_type ...... None
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   curriculum_enabled_legacy .... False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   curriculum_params_legacy ..... False
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-02-07 01:13:41,362] [INFO] [config.py:1003:print]   data_efficiency_enabled ...... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   dataloader_drop_last ......... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   disable_allgather ............ False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   dump_state ................... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   dynamic_loss_scale_args ...... None
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_enabled ........... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_gas_boundary_resolution  1
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_layer_num ......... 0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_max_iter .......... 100
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_stability ......... 1e-06
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_tol ............... 0.01
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   eigenvalue_verbose ........... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   elasticity_enabled ........... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   flops_profiler_config ........ {
    "enabled": false,
    "recompute_fwd_factor": 0.0,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   fp16_auto_cast ............... None
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   fp16_enabled ................. False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   fp16_master_weights_and_gradients  False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   global_rank .................. 0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   grad_accum_dtype ............. None
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   gradient_accumulation_steps .. 1
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   gradient_clipping ............ 1.0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   gradient_predivide_factor .... 1.0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   graph_harvesting ............. False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   initial_dynamic_scale ........ 1
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   load_universal_checkpoint .... False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   loss_scale ................... 1.0
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   memory_breakdown ............. False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   mics_hierarchial_params_gather  False
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   mics_shard_size .............. -1
[2025-02-07 01:13:41,363] [INFO] [config.py:1003:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   optimizer_legacy_fusion ...... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   optimizer_name ............... None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   optimizer_params ............. None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   pld_enabled .................. False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   pld_params ................... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   prescale_gradients ........... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   scheduler_name ............... None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   scheduler_params ............. None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   seq_parallel_communication_data_type  torch.float32
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   sparse_attention ............. None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   sparse_gradients_enabled ..... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   steps_per_print .............. inf
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   timers_config ................ enabled=True synchronized=True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   train_batch_size ............. 1
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   train_micro_batch_size_per_gpu  1
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   use_data_before_expert_parallel_  False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   use_node_local_storage ....... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   wall_clock_breakdown ......... False
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   weight_quantization_config ... None
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   world_size ................... 1
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_allow_untested_optimizer  True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_enabled ................. True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_force_ds_cpu_optimizer .. True
[2025-02-07 01:13:41,364] [INFO] [config.py:1003:print]   zero_optimization_stage ...... 3
[2025-02-07 01:13:41,364] [INFO] [config.py:989:print_user_config]   json = {
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 1,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "nvme_path": null
        },
        "offload_param": {
            "device": "cpu",
            "nvme_path": null
        },
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "zero_allow_untested_optimizer": true,
    "zero_optimization.reduce_bucket_size": 8.028160e+05,
    "zero_optimization.stage3_param_persistence_threshold": 8.960000e+03,
    "zero_optimization.stage3_prefetch_bucket_size": 0
}
===training policy===
wandb: Currently logged in as: ivan-eudokimoff2014 (ivan-eudokimoff2014-skolkovo-institute) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.19.5
wandb: Run data is saved locally in /home/calibri/experiments/rl_finetunning/wandb/run-20250207_011341-74y2rhd9
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ppo_config__42__1738890814
wandb: ⭐️ View project at https://wandb.ai/ivan-eudokimoff2014-skolkovo-institute/huggingface
wandb: πŸš€ View run at https://wandb.ai/ivan-eudokimoff2014-skolkovo-institute/huggingface/runs/74y2rhd9
  0%|                                                                                                                                                                                                               | 0/1000 [00:00<?, ?it/s]From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.
/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/ppo_trainer.py:640: UserWarning: var(): degrees of freedom is <= 0. Correction should be strictly less than the reduction factor (input numel divided by output numel). (Triggered internally at ../aten/src/ATen/native/ReduceOps.cpp:1808.)
  metrics["val/ratio_var"] = self.accelerator.gather_for_metrics(ratio_stats).var().item()
{'eps': 0, 'objective/kl': 0.6529181003570557, 'objective/entropy': 58.82828903198242, 'objective/non_score_reward': -0.032645903527736664, 'objective/rlhf_reward': -2.6420209407806396, 'objective/scores': -2.609375, 'policy/approxkl_avg': 0.0022548267152160406, 'policy/clipfrac_avg': 0.0, 'loss/policy_avg': 0.011362709105014801, 'loss/value_avg': 3.932140827178955, 'val/clipfrac_avg': 0.0, 'policy/entropy_avg': 1.484375, 'val/ratio': 0.9899009466171265, 'val/ratio_var': nan, 'val/num_eos_tokens': 0, 'lr': 3e-06, 'episode': 1, 'epoch': 0.0}
  0%|▏                                                                                                                                                                                                    | 1/1000 [00:05<1:25:03,  5.11s/it]┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ query                                                                                                        ┃ model response                                                                                               ┃ score       ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
β”‚ She couldn't get the saw-player the kid had mentioned out of her mind. Sounds Hawaiian, she thought over and β”‚  She couldn't help but giggle. Eddie was a good man, and she was glad to have him by her side. She had been  β”‚ -1.75       β”‚
β”‚ over again as Eddie pushed her grimly along in the new wheelchair, weaving in and out of the stalled         β”‚ in a car accident a few days ago, and she was in a terrible state of shock. She had been in a car            β”‚             β”‚
β”‚ vehicles. Sounds Hawaiian, doesn't it? Sounds fucking Hawaiian, doesn't it.                                  β”‚                                                                                                              β”‚             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ "You little piss-ant!" the girl snapped. "Don't tell me I slipped up. She died at seventeen. That's why I    β”‚  "I'm a man. I'm not a child. I'm not a child. I'm a man."                                                   β”‚ -0.58984375 β”‚
β”‚ wasn't there. I was never notified."                                                                         β”‚                                                                                                              β”‚             β”‚
β”‚                                                                                                              β”‚ "Then why are you here?" she demanded. "Why are you here? Why are you here? Why are you here? Why are you    β”‚             β”‚
β”‚ "But I don't do sixteen," he said, his voice going nasty.                                                    β”‚ here                                                                                                         β”‚             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Daniel flashed one of his own-a real one, this time.                                                         β”‚  The room was empty. I was alone. I didn't know what to say. I didn't know what to do. I didn't know what to β”‚ 0.44140625  β”‚
β”‚                                                                                                              β”‚ do.                                                                                                          β”‚             β”‚
β”‚ "I almost had a heart attack when Mom almost had a heart attack," he said, his voice quiet. Serious.         β”‚                                                                                                              β”‚             β”‚
β”‚ "I'm-I'm happy you're okay."                                                                                 β”‚ I sat down on the floor, my hands in my lap. I didn't know what to do                                        β”‚             β”‚
β”‚                                                                                                              β”‚                                                                                                              β”‚             β”‚
β”‚ I looked around the room.                                                                                    β”‚                                                                                                              β”‚             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ "Lights," the Oracle announced.Β  Images sprang up on the panels, showing coral and the calm swirling of sea  β”‚  I'm not sure what they are, but I'm not sure what they are."                                                β”‚ 12.6875     β”‚
β”‚ particles.                                                                                                   β”‚ Zook, a 22-year-old computer science student at the University of California, Berkeley, was watching the     β”‚             β”‚
β”‚ "What am I looking at?" asked Zook.                                                                          β”‚ Oracle's display of images on the screen. He was looking at                                                  β”‚             β”‚
β”‚ "The extra eyes of technology that I dropped behind us.                                                      β”‚                                                                                                              β”‚             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ These people could still smile and frown freely, they were just young. In fact, the only person I had seen   β”‚  I was a bit nervous, but I was determined to get the story out of my head. I was going to tell the story of β”‚ 5.71875     β”‚
β”‚ older than myself in the compound, was Dom. I frowned, and added that point to my agenda to discuss with the β”‚ how I got into the compound, and how I was able to get out. I was going to tell the story of how I           β”‚             β”‚
β”‚ others. Then I launched into the story.                                                                      β”‚                                                                                                              β”‚             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Traceback (most recent call last):
  File "/home/calibri/experiments/rl_finetunning/trl/examples/scripts/ppo/ppo.py", line 163, in <module>
    trainer.train()
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/ppo_trainer.py", line 556, in train
    output, vpred_temp = forward(model, mb_query_responses, processing_class.pad_token_id)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/utils.py", line 1224, in forward
    return model(
           ^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1914, in forward
    loss = self.module(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1592, in _call_impl
    args_result = hook(self, args)
                  ^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 241, in _start_of_forward_hook
    self.get_param_coordinator().reset_step()
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 235, in reset_step
    self.construct_parameter_trace_from_module_trace()
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 219, in construct_parameter_trace_from_module_trace
    self.record_parameters(sub_module)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 211, in record_parameters
    step_id = self.__step_id_module_fetched_for[sub_module.id].popleft()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: pop from an empty deque
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/calibri/experiments/rl_finetunning/trl/examples/scripts/ppo/ppo.py", line 163, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/ppo_trainer.py", line 556, in train
[rank0]:     output, vpred_temp = forward(model, mb_query_responses, processing_class.pad_token_id)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/trl/trainer/utils.py", line 1224, in forward
[rank0]:     return model(
[rank0]:            ^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1914, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1592, in _call_impl
[rank0]:     args_result = hook(self, args)
[rank0]:                   ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 241, in _start_of_forward_hook
[rank0]:     self.get_param_coordinator().reset_step()
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 235, in reset_step
[rank0]:     self.construct_parameter_trace_from_module_trace()
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 219, in construct_parameter_trace_from_module_trace
[rank0]:     self.record_parameters(sub_module)
[rank0]:   File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 211, in record_parameters
[rank0]:     step_id = self.__step_id_module_fetched_for[sub_module.id].popleft()
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: IndexError: pop from an empty deque
wandb:
wandb: πŸš€ View run ppo_config__42__1738890814 at: https://wandb.ai/ivan-eudokimoff2014-skolkovo-institute/huggingface/runs/74y2rhd9
wandb: Find logs at: wandb/run-20250207_011341-74y2rhd9/logs
E0207 01:13:55.827000 138054911975552 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 662027) of binary: /home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/python
Traceback (most recent call last):
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    deepspeed_launcher(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/calibri/.cache/pypoetry/virtualenvs/skoltech-llm-long-context-LD6GBRk7-py3.12/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
trl/examples/scripts/ppo/ppo.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-07_01:13:55
  host      : devai
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 662027)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
JohnConnor123 commented 1 month ago

Okay, I think it will be better to close this issue and open another issue with error IndexError: pop from an empty deque