Trainer throws "IndexError: list index out of range" when use TrainingArguments

Buggy output

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-09-17 15:18:18,970] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-17 15:18:18,990] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-17 15:18:19,163] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-17 15:18:19,179] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-17 15:18:20,622] [INFO] [comm.py:631:init_distributed] cdb=None
[2023-09-17 15:18:20,661] [INFO] [comm.py:631:init_distributed] cdb=None
[2023-09-17 15:18:20,661] [INFO] [comm.py:662:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-17 15:18:20,759] [INFO] [comm.py:631:init_distributed] cdb=None
[2023-09-17 15:18:20,902] [INFO] [comm.py:631:init_distributed] cdb=None
09/17/2023 15:18:20 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, fp16-bits training: False, bf16-bits training: True
09/17/2023 15:18:20 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=True,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=3600,
debug=[],
deepspeed=configs/deepspeed_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=False,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=1e-05,
length_column_name=length,
load_best_model_at_end=True,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/workspace/rlhf/saved_models/test_rm/test/runs/Sep17_15-18-20_pmnlplab-gpu-server-amax-03,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_length=64,
max_steps=-1,
metric_for_best_model=loss,
mp_parameters=,
no_cuda=False,
num_train_epochs=1,
optim=adamw_torch,
optim_args=None,
output_dir=/workspace/rlhf/saved_models/test_rm/test,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/workspace/rlhf/saved_models/test_rm/test,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=steps,
save_total_limit=3,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=1000,
weight_decay=0.001,
)
09/17/2023 15:18:20 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, fp16-bits training: False, bf16-bits training: True
09/17/2023 15:18:20 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, fp16-bits training: False, bf16-bits training: True
09/17/2023 15:18:20 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, fp16-bits training: False, bf16-bits training: True
[INFO|tokenization_utils_base.py:1965] 2023-09-17 15:18:21,329 >> loading file tokenizer.json from cache at /data/hanweiguang/.cache/huggingface/hub/models--bigscience--bloomz-560m/snapshots/a2845d7e13dd12efae154a9f1c63fcc2e0cc4b05/tokenizer.json
[INFO|tokenization_utils_base.py:1965] 2023-09-17 15:18:21,329 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1965] 2023-09-17 15:18:21,329 >> loading file special_tokens_map.json from cache at /data/hanweiguang/.cache/huggingface/hub/models--bigscience--bloomz-560m/snapshots/a2845d7e13dd12efae154a9f1c63fcc2e0cc4b05/special_tokens_map.json
[INFO|tokenization_utils_base.py:1965] 2023-09-17 15:18:21,329 >> loading file tokenizer_config.json from cache at /data/hanweiguang/.cache/huggingface/hub/models--bigscience--bloomz-560m/snapshots/a2845d7e13dd12efae154a9f1c63fcc2e0cc4b05/tokenizer_config.json

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 17260.51it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 1958.13it/s]

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 16 examples [00:00, 11896.63 examples/s]
num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
09/17/2023 15:18:23 - WARNING - datasets.arrow_dataset - num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.

Map (num_proc=16):   0%|          | 0/16 [00:00<?, ? examples/s]
Map (num_proc=16):   6%|▋         | 1/16 [00:00<00:07,  1.91 examples/s]
Map (num_proc=16):  12%|█▎        | 2/16 [00:00<00:04,  3.39 examples/s]
Map (num_proc=16):  19%|█▉        | 3/16 [00:00<00:02,  4.42 examples/s]
Map (num_proc=16):  25%|██▌       | 4/16 [00:00<00:02,  5.37 examples/s]
Map (num_proc=16):  31%|███▏      | 5/16 [00:01<00:01,  5.96 examples/s]
Map (num_proc=16):  38%|███▊      | 6/16 [00:01<00:01,  6.40 examples/s]
Map (num_proc=16):  44%|████▍     | 7/16 [00:01<00:01,  6.52 examples/s]
Map (num_proc=16):  50%|█████     | 8/16 [00:01<00:01,  6.72 examples/s]
Map (num_proc=16):  56%|█████▋    | 9/16 [00:01<00:01,  5.55 examples/s]
Map (num_proc=16):  69%|██████▉   | 11/16 [00:01<00:00,  7.82 examples/s]
Map (num_proc=16):  75%|███████▌  | 12/16 [00:02<00:00,  6.31 examples/s]
Map (num_proc=16):  94%|█████████▍| 15/16 [00:02<00:00,  8.50 examples/s]
Map (num_proc=16): 100%|██████████| 16/16 [00:02<00:00,  8.31 examples/s]
Map (num_proc=16): 100%|██████████| 16/16 [00:02<00:00,  6.25 examples/s]
Parameter 'function'=<function main.<locals>.<lambda> at 0x7fd38f53bca0> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
09/17/2023 15:18:26 - WARNING - datasets.fingerprint - Parameter 'function'=<function main.<locals>.<lambda> at 0x7fd38f53bca0> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

Filter:   0%|          | 0/16 [00:00<?, ? examples/s]
Filter: 100%|██████████| 16/16 [00:00<00:00, 1529.44 examples/s]
num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
09/17/2023 15:18:26 - WARNING - datasets.arrow_dataset - num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.

Filter:   0%|          | 0/16 [00:00<?, ? examples/s]
Filter: 100%|██████████| 16/16 [00:00<00:00, 4063.51 examples/s]
Eval tokenized example: {'input_ids_chosen': [20, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404], 'attention_mask_chosen': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids_rejected': [20, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415], 'attention_mask_rejected': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Eval tokenized example: {'input_ids_chosen': [21, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415], 'attention_mask_chosen': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids_rejected': [21, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735], 'attention_mask_rejected': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Train tokenized example: {'input_ids_chosen': [20, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404, 15, 404], 'attention_mask_chosen': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids_rejected': [20, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415, 15, 404, 15, 415], 'attention_mask_rejected': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Train tokenized example: {'input_ids_chosen': [21, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415, 15, 415], 'attention_mask_chosen': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids_rejected': [21, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735, 15, 415, 15, 735], 'attention_mask_rejected': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
num_gpus = 4, training_nums = 16, num_steps = 4, warmup_steps = 1000, eval_steps = 5, save_steps = 5
[INFO|configuration_utils.py:715] 2023-09-17 15:18:27,904 >> loading configuration file config.json from cache at /data/hanweiguang/.cache/huggingface/hub/models--bigscience--bloomz-560m/snapshots/a2845d7e13dd12efae154a9f1c63fcc2e0cc4b05/config.json
[INFO|configuration_utils.py:775] 2023-09-17 15:18:27,914 >> Model config BloomConfig {
  "_name_or_path": "bigscience/bloomz-560m",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "BloomForCausalLM"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_dropout": 0.0,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "masked_softmax_fusion": true,
  "model_type": "bloom",
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "offset_alibi": 100,
  "pad_token_id": 3,
  "pretraining_tp": 1,
  "seq_length": 2048,
  "skip_bias_add": true,
  "skip_bias_add_qkv": false,
  "slow_but_exact": false,
  "transformers_version": "4.34.0.dev0",
  "unk_token_id": 0,
  "use_cache": true,
  "vocab_size": 250880
}

[INFO|modeling_utils.py:2899] 2023-09-17 15:18:27,935 >> loading weights file model.safetensors from cache at /data/hanweiguang/.cache/huggingface/hub/models--bigscience--bloomz-560m/snapshots/a2845d7e13dd12efae154a9f1c63fcc2e0cc4b05/model.safetensors
[INFO|modeling_utils.py:2990] 2023-09-17 15:18:27,949 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
09/17/2023 15:18:29 - WARNING - datasets.arrow_dataset - num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
09/17/2023 15:18:29 - WARNING - datasets.arrow_dataset - num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
Parameter 'function'=<function main.<locals>.<lambda> at 0x7fc0c0136af0> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
09/17/2023 15:18:29 - WARNING - datasets.fingerprint - Parameter 'function'=<function main.<locals>.<lambda> at 0x7fc0c0136af0> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
09/17/2023 15:18:29 - WARNING - datasets.arrow_dataset - num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
Parameter 'function'=<function main.<locals>.<lambda> at 0x7f1c34135af0> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
09/17/2023 15:18:29 - WARNING - datasets.fingerprint - Parameter 'function'=<function main.<locals>.<lambda> at 0x7f1c34135af0> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
09/17/2023 15:18:29 - WARNING - datasets.arrow_dataset - num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
09/17/2023 15:18:29 - WARNING - datasets.arrow_dataset - num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
Parameter 'function'=<function main.<locals>.<lambda> at 0x7f428cb1aaf0> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
09/17/2023 15:18:29 - WARNING - datasets.fingerprint - Parameter 'function'=<function main.<locals>.<lambda> at 0x7f428cb1aaf0> of the transform datasets.arrow_dataset.Dataset.filter@2.0.1 couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
09/17/2023 15:18:29 - WARNING - datasets.arrow_dataset - num_proc must be <= 16. Reducing num_proc to 16 for dataset of size 16.
[2023-09-17 15:18:30,626] [INFO] [partition_parameters.py:332:__exit__] finished initializing model - num_params = 294, num_elems = 0.56B
[INFO|modeling_utils.py:3685] 2023-09-17 15:18:31,505 >> All model checkpoint weights were used when initializing BloomForSequenceClassification.

[INFO|modeling_utils.py:3693] 2023-09-17 15:18:31,505 >> All the weights of BloomForSequenceClassification were initialized from the model checkpoint at bigscience/bloomz-560m.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BloomForSequenceClassification for predictions without further training.
[INFO|trainer.py:757] 2023-09-17 15:18:31,708 >> The following columns in the training set don't have a corresponding argument in `BloomForSequenceClassification.forward` and have been ignored: input_ids_chosen, attention_mask_chosen, input_ids_rejected, attention_mask_rejected. If input_ids_chosen, attention_mask_chosen, input_ids_rejected, attention_mask_rejected are not expected by `BloomForSequenceClassification.forward`,  you can safely ignore this message.
[2023-09-17 15:18:31,715] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.1, git-hash=unknown, git-branch=unknown
[2023-09-17 15:18:31,771] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-09-17 15:18:31,771] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-09-17 15:18:31,780] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-09-17 15:18:31,780] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-09-17 15:18:31,781] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2023-09-17 15:18:31,781] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2023-09-17 15:18:31,858] [INFO] [utils.py:803:see_memory_usage] Stage 3 initialize beginning
[2023-09-17 15:18:31,858] [INFO] [utils.py:804:see_memory_usage] MA 0.27 GB         Max_MA 1.44 GB         CA 1.45 GB         Max_CA 2 GB 
[2023-09-17 15:18:31,859] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 85.95 GB, percent = 17.1%
[2023-09-17 15:18:31,860] [INFO] [stage3.py:123:__init__] Reduce bucket size 1048576
[2023-09-17 15:18:31,860] [INFO] [stage3.py:124:__init__] Prefetch bucket size 50,000,000
[2023-09-17 15:18:31,932] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-09-17 15:18:31,932] [INFO] [utils.py:804:see_memory_usage] MA 0.27 GB         Max_MA 0.27 GB         CA 1.45 GB         Max_CA 1 GB 
[2023-09-17 15:18:31,932] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 85.95 GB, percent = 17.1%
Parameter Offload: Total persistent parameters: 324608 in 197 params
[2023-09-17 15:18:32,009] [INFO] [utils.py:803:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-09-17 15:18:32,010] [INFO] [utils.py:804:see_memory_usage] MA 0.27 GB         Max_MA 0.27 GB         CA 1.45 GB         Max_CA 1 GB 
[2023-09-17 15:18:32,010] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 85.95 GB, percent = 17.1%
[2023-09-17 15:18:32,087] [INFO] [utils.py:803:see_memory_usage] Before creating fp16 partitions
[2023-09-17 15:18:32,087] [INFO] [utils.py:804:see_memory_usage] MA 0.27 GB         Max_MA 0.27 GB         CA 1.45 GB         Max_CA 1 GB 
[2023-09-17 15:18:32,087] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 85.95 GB, percent = 17.1%
[2023-09-17 15:18:32,520] [INFO] [utils.py:803:see_memory_usage] After creating fp16 partitions: 1
[2023-09-17 15:18:32,521] [INFO] [utils.py:804:see_memory_usage] MA 0.26 GB         Max_MA 0.27 GB         CA 0.26 GB         Max_CA 1 GB 
[2023-09-17 15:18:32,521] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 86.02 GB, percent = 17.1%
[2023-09-17 15:18:32,593] [INFO] [utils.py:803:see_memory_usage] Before creating fp32 partitions
[2023-09-17 15:18:32,593] [INFO] [utils.py:804:see_memory_usage] MA 0.26 GB         Max_MA 0.26 GB         CA 0.26 GB         Max_CA 0 GB 
[2023-09-17 15:18:32,594] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 86.02 GB, percent = 17.1%
[2023-09-17 15:18:32,667] [INFO] [utils.py:803:see_memory_usage] After creating fp32 partitions
[2023-09-17 15:18:32,668] [INFO] [utils.py:804:see_memory_usage] MA 0.78 GB         Max_MA 1.04 GB         CA 1.05 GB         Max_CA 1 GB 
[2023-09-17 15:18:32,668] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 86.02 GB, percent = 17.1%
[2023-09-17 15:18:32,755] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states
[2023-09-17 15:18:32,756] [INFO] [utils.py:804:see_memory_usage] MA 0.78 GB         Max_MA 0.78 GB         CA 1.05 GB         Max_CA 1 GB 
[2023-09-17 15:18:32,756] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 86.08 GB, percent = 17.1%
[2023-09-17 15:18:32,858] [INFO] [utils.py:803:see_memory_usage] After initializing optimizer states
[2023-09-17 15:18:32,859] [INFO] [utils.py:804:see_memory_usage] MA 1.82 GB         Max_MA 3.39 GB         CA 3.65 GB         Max_CA 4 GB 
[2023-09-17 15:18:32,859] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 86.04 GB, percent = 17.1%
[2023-09-17 15:18:32,859] [INFO] [stage3.py:431:_setup_for_real_optimizer] optimizer state initialized
[2023-09-17 15:18:32,957] [WARNING] [lr_schedules.py:751:__init__] total_num_steps 4 is less than warmup_num_steps 1000
[2023-09-17 15:18:32,957] [WARNING] [lr_schedules.py:751:__init__] total_num_steps 4 is less than warmup_num_steps 1000
[2023-09-17 15:18:32,958] [WARNING] [lr_schedules.py:751:__init__] total_num_steps 4 is less than warmup_num_steps 1000
[2023-09-17 15:18:33,046] [INFO] [utils.py:803:see_memory_usage] After initializing ZeRO optimizer
[2023-09-17 15:18:33,047] [INFO] [utils.py:804:see_memory_usage] MA 2.09 GB         Max_MA 3.04 GB         CA 4.13 GB         Max_CA 4 GB 
[2023-09-17 15:18:33,047] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 86.05 GB, percent = 17.1%
[2023-09-17 15:18:33,047] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2023-09-17 15:18:33,047] [WARNING] [lr_schedules.py:751:__init__] total_num_steps 4 is less than warmup_num_steps 1000
[2023-09-17 15:18:33,047] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
[2023-09-17 15:18:33,047] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7fd370076fd0>
[2023-09-17 15:18:33,047] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-05], mom=[[0.9, 0.999]]
[2023-09-17 15:18:33,048] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   amp_enabled .................. False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   amp_params ................... False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fd37c09f100>
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   communication_data_type ...... None
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   disable_allgather ............ False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   dump_state ................... False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-09-17 15:18:33,048] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   fp16_enabled ................. False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   global_rank .................. 0
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   gradient_clipping ............ 1.0
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   loss_scale ................... 1.0
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   optimizer_name ............... adamw
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   optimizer_params ............. {'lr': 1e-05, 'weight_decay': 0.001, 'betas': [0.9, 0.999], 'eps': 1e-08}
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   pld_enabled .................. False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   pld_params ................... False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   scheduler_name ............... WarmupDecayLR
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 1e-05, 'warmup_num_steps': 1000, 'total_num_steps': 4}
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   sparse_attention ............. None
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   steps_per_print .............. inf
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   train_batch_size ............. 4
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   use_node_local_storage ....... False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   world_size ................... 4
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1048576 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   zero_enabled ................. True
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2023-09-17 15:18:33,049] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
[2023-09-17 15:18:33,049] [INFO] [config.py:950:print_user_config]   json = {
    "zero_optimization": {
        "stage": 3, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.048576e+06, 
        "overlap_comm": true, 
        "contiguous_gradients": true
    }, 
    "scheduler": {
        "type": "WarmupDecayLR", 
        "params": {
            "warmup_min_lr": 0, 
            "warmup_max_lr": 1e-05, 
            "warmup_num_steps": 1000, 
            "total_num_steps": 4
        }
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 32, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 1e-05, 
            "weight_decay": 0.001, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08
        }
    }, 
    "gradient_accumulation_steps": 1, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 4, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false
}
[INFO|trainer.py:1743] 2023-09-17 15:18:33,050 >> ***** Running training *****
[INFO|trainer.py:1744] 2023-09-17 15:18:33,050 >>   Num examples = 16
[INFO|trainer.py:1745] 2023-09-17 15:18:33,050 >>   Num Epochs = 1
[INFO|trainer.py:1746] 2023-09-17 15:18:33,050 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1749] 2023-09-17 15:18:33,050 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:1750] 2023-09-17 15:18:33,050 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1751] 2023-09-17 15:18:33,050 >>   Total optimization steps = 4
[INFO|trainer.py:1752] 2023-09-17 15:18:33,050 >>   Number of trainable parameters = 559,215,616

  0%|          | 0/4 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py:100: RuntimeWarning: divide by zero encountered in remainder
  return table.fast_gather(key % table.num_rows)
/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py:100: RuntimeWarning: divide by zero encountered in remainder
  return table.fast_gather(key % table.num_rows)
Traceback (most recent call last):
  File "rm/train_rm_bug.py", line 484, in <module>
/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py:100: RuntimeWarning: divide by zero encountered in remainder
  return table.fast_gather(key % table.num_rows)
Traceback (most recent call last):
  File "rm/train_rm_bug.py", line 484, in <module>
Traceback (most recent call last):
      File "rm/train_rm_bug.py", line 484, in <module>
main()
  File "rm/train_rm_bug.py", line 473, in main
/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py:100: RuntimeWarning: divide by zero encountered in remainder
  return table.fast_gather(key % table.num_rows)
    main()
  File "rm/train_rm_bug.py", line 473, in main
    trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/transformers/src/transformers/trainer.py", line 1575, in train
    main()
  File "rm/train_rm_bug.py", line 473, in main
    trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/transformers/src/transformers/trainer.py", line 1575, in train
    trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/transformers/src/transformers/trainer.py", line 1575, in train
Traceback (most recent call last):
  File "rm/train_rm_bug.py", line 484, in <module>
    return inner_training_loop(
  File "/workspace/transformers/src/transformers/trainer.py", line 1853, in _inner_training_loop
    return inner_training_loop(
  File "/workspace/transformers/src/transformers/trainer.py", line 1853, in _inner_training_loop
    return inner_training_loop(
  File "/workspace/transformers/src/transformers/trainer.py", line 1853, in _inner_training_loop
    main()
  File "rm/train_rm_bug.py", line 473, in main
    for step, inputs in enumerate(epoch_iterator):
  File "/workspace/accelerate/src/accelerate/data_loader.py", line 384, in __iter__
    for step, inputs in enumerate(epoch_iterator):
  File "/workspace/accelerate/src/accelerate/data_loader.py", line 384, in __iter__
    trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/transformers/src/transformers/trainer.py", line 1575, in train
    for step, inputs in enumerate(epoch_iterator):    
current_batch = next(dataloader_iter)
  File "/workspace/accelerate/src/accelerate/data_loader.py", line 384, in __iter__
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__
    current_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__
    current_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = self.dataset.__getitems__(possibly_batched_index)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2807, in __getitems__
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = self.dataset.__getitems__(possibly_batched_index)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2807, in __getitems__
    data = self.dataset.__getitems__(possibly_batched_index)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2807, in __getitems__
    return inner_training_loop(
  File "/workspace/transformers/src/transformers/trainer.py", line 1853, in _inner_training_loop
    batch = self.__getitem__(keys)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2803, in __getitem__
    batch = self.__getitem__(keys)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2803, in __getitem__
    batch = self.__getitem__(keys)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2803, in __getitem__
    for step, inputs in enumerate(epoch_iterator):
  File "/workspace/accelerate/src/accelerate/data_loader.py", line 384, in __iter__
    current_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__
        return self._getitem(key)data = self._next_data()
      File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2787, in _getitem

return self._getitem(key)    
return self._getitem(key)  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data

  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2787, in _getitem
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2787, in _getitem
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = self.dataset.__getitems__(possibly_batched_index)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2807, in __getitems__
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 588, in query_table
        pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)

  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 588, in query_table
  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 588, in query_table
    pa_subtable = _query_table_with_indices_mapping(table, key, indices=indices)
  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 75, in _query_table_with_indices_mapping
        pa_subtable = _query_table_with_indices_mapping(table, key, indices=indices)pa_subtable = _query_table_with_indices_mapping(table, key, indices=indices)

  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 75, in _query_table_with_indices_mapping
  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 75, in _query_table_with_indices_mapping
    return _query_table(table, [indices.fast_slice(i, 1).column(0)[0].as_py() for i in key])
  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 100, in _query_table
        return _query_table(table, [indices.fast_slice(i, 1).column(0)[0].as_py() for i in key])return _query_table(table, [indices.fast_slice(i, 1).column(0)[0].as_py() for i in key])

  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 100, in _query_table
  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 100, in _query_table
    return table.fast_gather(key % table.num_rows)
  File "/usr/local/lib/python3.8/dist-packages/datasets/table.py", line 134, in fast_gather
        return table.fast_gather(key % table.num_rows)return table.fast_gather(key % table.num_rows)

  File "/usr/local/lib/python3.8/dist-packages/datasets/table.py", line 134, in fast_gather
  File "/usr/local/lib/python3.8/dist-packages/datasets/table.py", line 134, in fast_gather
    [
  File "/usr/local/lib/python3.8/dist-packages/datasets/table.py", line 135, in <listcomp>
        [
  File "/usr/local/lib/python3.8/dist-packages/datasets/table.py", line 135, in <listcomp>
[
  File "/usr/local/lib/python3.8/dist-packages/datasets/table.py", line 135, in <listcomp>
    self._batches[batch_idx].slice(i - self._offsets[batch_idx], 1)
IndexError:         list index out of rangeself._batches[batch_idx].slice(i - self._offsets[batch_idx], 1)self._batches[batch_idx].slice(i - self._offsets[batch_idx], 1)

IndexErrorIndexError: list index out of range: 
list index out of range
    batch = self.__getitem__(keys)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2803, in __getitem__
    return self._getitem(key)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2787, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 588, in query_table
    pa_subtable = _query_table_with_indices_mapping(table, key, indices=indices)
  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 75, in _query_table_with_indices_mapping
    return _query_table(table, [indices.fast_slice(i, 1).column(0)[0].as_py() for i in key])
  File "/usr/local/lib/python3.8/dist-packages/datasets/formatting/formatting.py", line 100, in _query_table
    return table.fast_gather(key % table.num_rows)
  File "/usr/local/lib/python3.8/dist-packages/datasets/table.py", line 134, in fast_gather
    [
  File "/usr/local/lib/python3.8/dist-packages/datasets/table.py", line 135, in <listcomp>
    self._batches[batch_idx].slice(i - self._offsets[batch_idx], 1)
IndexError: list index out of range

  0%|          | 0/4 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6238) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
rm/train_rm_bug.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-09-17_15:18:36
  host      : pmnlplab-gpu-server-amax-03
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 6239)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-09-17_15:18:36
  host      : pmnlplab-gpu-server-amax-03
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 6240)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-09-17_15:18:36
  host      : pmnlplab-gpu-server-amax-03
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 6241)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-17_15:18:36
  host      : pmnlplab-gpu-server-amax-03
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 6238)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

- `Accelerate` version: 0.24.0.dev0
- Platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.22.2
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.72 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        Not found

Package                 Version                         Editable project location
----------------------- ------------------------------- -------------------------
absl-py                 1.3.0
accelerate              0.24.0.dev0                     /workspace/accelerate
aiofiles                23.2.1
aiohttp                 3.8.5
aiosignal               1.3.1
alembic                 1.11.3
altair                  5.1.1
anyio                   4.0.0
apex                    0.1
appdirs                 1.4.4
argon2-cffi             21.3.0
argon2-cffi-bindings    21.2.0
arrow                   1.2.3
asttokens               2.2.1
astunparse              1.6.3
async-timeout           4.0.3
attrs                   22.1.0
audioread               3.0.0
backcall                0.2.0
backoff                 2.2.1
beautifulsoup4          4.11.1
binaryornot             0.4.4
bitsandbytes            0.41.1
black                   23.7.0
bleach                  5.0.1
blessed                 1.20.0
blis                    0.7.9
cachetools              5.2.0
catalogue               2.0.8
certifi                 2022.12.7
cffi                    1.15.1
chardet                 5.2.0
charset-normalizer      2.1.1
click                   8.1.3
cloudpickle             2.2.0
cmaes                   0.10.0
cmake                   3.24.1.1
colorlog                6.7.0
comm                    0.1.4
confection              0.0.3
contourpy               1.0.6
cookiecutter            1.7.3
cuda-python             11.7.0+0.g95a2041.dirty
cudf                    22.10.0a0+316.gad1ba132d2.dirty
cugraph                 22.10.0a0+113.g6bbdadf8.dirty
cuml                    22.10.0a0+56.g3a8dea659.dirty
cupy-cuda118            11.0.0
cycler                  0.11.0
cymem                   2.0.7
Cython                  0.29.32
dask                    2022.9.2
dask-cuda               22.10.0a0+23.g62a1ee8
dask-cudf               22.10.0a0+316.gad1ba132d2.dirty
datasets                2.14.4
dbus-python             1.2.16
debugpy                 1.6.7.post1
decorator               5.1.1
deepspeed               0.10.1
defusedxml              0.7.1
dill                    0.3.4
distributed             2022.9.2
distro                  1.4.0
docker-pycreds          0.4.0
docstring-parser        0.15
einops                  0.6.1
entrypoints             0.4
evaluate                0.4.0
exceptiongroup          1.0.4
execnet                 1.9.0
executing               1.2.0
expecttest              0.1.3
faiss-cpu               1.7.4
fastapi                 0.103.0
fastjsonschema          2.16.2
fastrlock               0.8.1
ffmpy                   0.3.1
filelock                3.12.3
flash-attn              2.1.1
fonttools               4.38.0
frozendict              2.3.8
frozenlist              1.4.0
fsspec                  2022.11.0
gitdb                   4.0.10
GitPython               3.1.18
google-auth             2.15.0
google-auth-oauthlib    0.4.6
gpustat                 1.2.dev11+ge32e3f2
gql                     3.4.1
gradio                  3.41.2
gradio_client           0.5.0
graphql-core            3.2.3
graphsurgeon            0.4.6
greenlet                2.0.2
grpcio                  1.51.1
h11                     0.14.0
HeapDict                1.0.1
hf-doc-builder          0.4.0
hjson                   3.1.0
httpcore                0.17.3
httpx                   0.24.1
huggingface-hub         0.16.4
hypothesis              5.35.1
idna                    3.4
importlib-metadata      5.1.0
importlib-resources     5.10.1
iniconfig               1.1.1
install                 1.3.5
intel-openmp            2021.4.0
ipykernel               6.25.1
ipython                 8.7.0
ipython-genutils        0.2.0
ipywidgets              8.1.0
jedi                    0.18.2
Jinja2                  3.1.2
jinja2-time             0.2.0
joblib                  1.2.0
json5                   0.9.10
jsonschema              4.17.3
jupyter_client          7.4.8
jupyter_core            5.1.0
jupyter-tensorboard     0.2.0
jupyterlab              2.3.2
jupyterlab-pygments     0.2.2
jupyterlab-server       1.2.0
jupyterlab-widgets      3.0.8
jupytext                1.14.4
kiwisolver              1.4.4
langcodes               3.3.0
librosa                 0.9.2
lit                     16.0.6
llvmlite                0.39.1
locket                  1.0.0
Mako                    1.2.4
Markdown                3.4.1
markdown-it-py          3.0.0
MarkupSafe              2.1.1
matplotlib              3.6.2
matplotlib-inline       0.1.6
mdit-py-plugins         0.3.3
mdurl                   0.1.2
mistune                 2.0.4
mkl                     2021.1.1
mkl-devel               2021.1.1
mkl-include             2021.1.1
mock                    4.0.3
mpmath                  1.2.1
msgpack                 1.0.4
multidict               6.0.4
multiprocess            0.70.12.2
murmurhash              1.0.9
mypy-extensions         1.0.0
nbclient                0.7.2
nbconvert               7.2.6
nbformat                5.7.0
nest-asyncio            1.5.6
networkx                2.6.3
ninja                   1.11.1
nltk                    3.8.1
notebook                6.4.10
numba                   0.56.4
numpy                   1.22.2
nvidia-dali-cuda110     1.20.0
nvidia-ml-py            12.535.108
nvidia-pyindex          1.0.9
nvtx                    0.2.5
oauthlib                3.2.2
onnx                    1.12.0
opencv                  4.6.0
optuna                  3.3.0
orjson                  3.9.5
packaging               22.0
pandas                  1.5.2
pandocfilters           1.5.0
parameterized           0.9.0
parso                   0.8.3
partd                   1.3.0
pathspec                0.11.2
pathtools               0.1.2
pathy                   0.10.1
peft                    0.5.0
pexpect                 4.8.0
pickleshare             0.7.5
Pillow                  9.2.0
pip                     23.2.1
pkgutil_resolve_name    1.3.10
platformdirs            2.6.0
pluggy                  1.0.0
polygraphy              0.43.1
pooch                   1.6.0
portalocker             2.0.0
poyo                    0.5.0
preshed                 3.0.8
prettytable             3.5.0
prometheus-client       0.15.0
prompt-toolkit          3.0.36
protobuf                3.20.1
psutil                  5.9.4
ptyprocess              0.7.0
pudb                    2022.1.3
pure-eval               0.2.2
py-cpuinfo              9.0.0
pyarrow                 9.0.0
pyasn1                  0.4.8
pyasn1-modules          0.2.8
pybind11                2.10.1
pycocotools             2.0+nv0.7.1
pycparser               2.21
pydantic                1.10.12
pydub                   0.25.1
Pygments                2.13.0
PyGObject               3.36.0
pylibcugraph            22.10.0a0+113.g6bbdadf8.dirty
pylibraft               22.10.0a0+81.g08abc72.dirty
pynvml                  11.4.1
pyparsing               3.0.9
pyrsistent              0.19.2
pytest                  7.2.0
pytest-rerunfailures    10.3
pytest-shard            0.1.2
pytest-timeout          2.1.0
pytest-xdist            3.1.0
python-dateutil         2.8.2
python-hostlist         1.22
python-multipart        0.0.6
python-slugify          8.0.1
pytorch-quantization    2.1.2
pytz                    2022.6
PyYAML                  6.0
pyzmq                   24.0.1
raft-dask               22.10.0a0+81.g08abc72.dirty
regex                   2022.10.31
requests                2.28.1
requests-oauthlib       1.3.1
requests-toolbelt       0.10.1
resampy                 0.4.2
responses               0.18.0
rich                    13.5.2
rjieba                  0.1.11
rmm                     22.10.0a0+38.ge043158.dirty
rouge-score             0.1.2
rsa                     4.9
sacrebleu               1.5.1
sacremoses              0.0.53
safetensors             0.3.3
scikit-learn            0.24.2
scipy                   1.6.3
semantic-version        2.10.0
Send2Trash              1.8.0
sentencepiece           0.1.99
sentry-sdk              1.30.0
setproctitle            1.3.2
setuptools              59.5.0
shtab                   1.6.4
six                     1.16.0
smart-open              6.3.0
smmap                   5.0.0
sniffio                 1.3.0
socksio                 1.0.0
sortedcontainers        2.4.0
soundfile               0.11.0
soupsieve               2.3.2.post1
spacy                   3.4.4
spacy-legacy            3.0.10
spacy-loggers           1.0.4
sphinx-glpi-theme       0.3
SQLAlchemy              2.0.20
srsly                   2.4.5
ssh-import-id           5.10
stack-data              0.6.2
starlette               0.27.0
sympy                   1.11.1
tbb                     2021.7.1
tblib                   1.7.0
tensorboard             2.9.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
tensorrt                8.5.1.7
terminado               0.17.1
text-unidecode          1.3
thinc                   8.1.5
threadpoolctl           3.1.0
timeout-decorator       0.5.0
tinycss2                1.2.1
tokenizers              0.13.3
toml                    0.10.2
tomli                   2.0.1
toolz                   0.12.0
torch                   2.0.1+cu118
torchaudio              2.0.2+cu118
torchtext               0.13.0a0+fae8e8c
torchvision             0.15.2+cu118
tornado                 6.3.3
tqdm                    4.64.1
traitlets               5.7.1
transformers            4.34.0.dev0                     /workspace/transformers
treelite                2.4.0
treelite-runtime        2.4.0
triton                  2.0.0
trl                     0.7.2.dev0                      /workspace/trl
typer                   0.7.0
typing_extensions       4.7.1
tyro                    0.5.7
ucx-py                  0.27.0a0+29.ge9e81f8
uff                     0.6.9
urllib3                 1.26.13
urwid                   2.1.2
urwid-readline          0.13
uvicorn                 0.23.2
wandb                   0.15.9
wasabi                  0.10.1
wcwidth                 0.2.5
webencodings            0.5.1
websockets              11.0.3
Werkzeug                2.2.2
wheel                   0.38.4
widgetsnbextension      4.0.8
xdoctest                1.0.2
xformers                0.0.21
xgboost                 1.6.2
xxhash                  3.3.0
yarl                    1.9.2
zict                    2.2.0
zipp                    3.11.0

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I have packaged my environment in tothemoon/temp:20230917

docker pull tothemoon/temp:20230917
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    --privileged \
    --network host \
    -d --rm \
    --name temp \
    -v /data:/data \
    -w /workspace \
    tothemoon/temp:20230917 \
    --cmd "/bin/bash"

After enter docker environment, please clone https://github.com/tothemoon96/rlhf.git

git clone https://github.com/tothemoon96/rlhf.git

Reproduction

cd rlhf
bash script/rm_test.sh

Expected behavior

The normal run train_rm.py commented in script/rm_test.sh should be idientical to train_rm_bug.py without exceptions

huggingface / accelerate

Trainer throws "IndexError: list index out of range" when use TrainingArguments #1984

Buggy output

System Info

Information

Tasks

Reproduction

Expected behavior