RuntimeError: Storage size calculation overflowed with sizes=[1, 4623015400198258675]

artkpv commented 3 months ago

System Info

- `Accelerate` version: 0.31.0
- Platform: Linux-3.10.0-1160.83.1.0.1.el7.x86_64-x86_64-with-glibc2.17
- `accelerate` bash location: /data/......./venv/bin/accelerate
- Python version: 3.11.5
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 2015.46 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: bf16
        - use_cpu: False
        - debug: True
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I'm running Accelerate with TRL to train LLAMA 3 70B. It breaks with the above exception when it somehow gets huge max size inside PPOTrainer.step(). Any hints for what is wrong? Thanks

Main code:


    accelerator = Accelerator(
        kwargs_handlers=[
            InitProcessGroupKwargs(timeout=timedelta(minutes=30), backend="nccl")
        ]
    )
    torch.cuda.empty_cache()

    device = accelerator.device

    # Load model
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )

    model = AutoModelForCausalLMWithValueHead.from_pretrained(
        context.model_name,
        peft_config=lora_config,
        attn_implementation="sdpa",
    )
    # Tokenizer:
    tokenizer = AutoTokenizer.from_pretrained(context.model_name)
    tokenizer.pad_token_id = tokenizer.eos_token_id

    # Refs: llama-recipes: src/llama_recipes/finetuning.py
    # If there is a mismatch between tokenizer vocab size and embedding matrix,
    # throw a warning and then expand the embedding matrix
    if len(tokenizer) > model.pretrained_model.get_input_embeddings().weight.shape[0]:
        print(
            "WARNING: Resizing the embedding matrix to match the tokenizer vocab size."
        )
        model.pretrained_model.resize_token_embeddings(len(tokenizer))

    context.generation_kwargs |= dict(
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

    model.eval()
    with accelerator.main_process_first():
        steg_ds = build_dataset(context, tokenizer=tokenizer)
    accelerator.wait_for_everyone()

    ppo_trainer = PPOTrainer(
        context.ppo_config,
        model,
        ref_model=None,
        tokenizer=tokenizer,
        dataset=steg_ds["train"],
        data_collator=collator,
    )
    tokenizer = ppo_trainer.tokenizer  # type: ignore # It is changed by PPOTrainer.
    model = ppo_trainer.model  # type: ignore
    accelerator.wait_for_everyone()

    output_length_sampler = LengthSampler(
        context.output_min_length, context.output_max_length
    )
    dataloader: torch.utils.data.DataLoader = ppo_trainer.dataloader  # type: ignore
    for epoch in tqdm(range(context.epoch_num)):
        for batch in dataloader:

            question_tensors = batch["input_ids"]
            batch["response"] = []
            response_tensors = []
            for input_ids in question_tensors:  # TODO: make it batched.
                response_ids = ppo_trainer.model.generate(
                    input_ids.unsqueeze(0),  # Add batch dimention.
                    max_new_tokens=context.output_max_length,
                    **context.generation_kwargs,
                )
                # Take only response:
                response_ids = response_ids[..., input_ids.shape[-1] :]
                decoded = tokenizer.batch_decode(response_ids)
                batch["response"].append(decoded[0])
                response_tensors.append(response_ids[0])

            # Compute reward score:
            rewards, caught_num, decoded_num, success_num = reward_batch(
                batch, ppo_trainer, tokenizer, device, context
            )

            # Run PPO step
            # Log shapes of question_tensors and response_tensors
            for inx, q, response, reward in zip(
                range(len(question_tensors)),
                question_tensors,
                response_tensors,
                rewards,
            ):
            stats = ppo_trainer.step(question_tensors, response_tensors, rewards)  # type: ignore

            b_len = len(batch["response"])
            stats["train/decoder_rate"] = decoded_num / b_len
            stats["train/caught_rate"] = caught_num / b_len
            stats["train/success_rate"] = success_num / b_len
            stats["dl/epoch"] = epoch
            ppo_trainer.log_stats(  # type: ignore
                stats,
                batch,
                rewards,
                columns_to_log=["bit", "query", "response"],
            )

Accelerate config:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The bug:

> 2024-06-18 08:27:17,705::40022__main__:DEBUG Before PPO step
> [rank0]:[E618 08:57:17.122997522 ProcessGroupNCCL.cpp:572] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800028
>  milliseconds before timing out.
> [rank0]:[E618 08:57:17.124263345 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> [rank3]:[E618 08:57:17.135739202 ProcessGroupNCCL.cpp:572] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800040 milliseconds before timing out.
> [rank3]:[E618 08:57:17.136141133 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> [rank2]:[E618 08:57:17.145912074 ProcessGroupNCCL.cpp:572] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800050 milliseconds before timing out.
> [rank2]:[E618 08:57:17.146319125 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> [rank2]:[E618 08:57:17.513319285 ProcessGroupNCCL.cpp:1632] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> [rank2]:[E618 08:57:17.513661404 ProcessGroupNCCL.cpp:586] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
> [rank2]:[E618 08:57:17.513921151 ProcessGroupNCCL.cpp:592] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
> [rank2]:[E618 08:57:17.515218876 ProcessGroupNCCL.cpp:1448] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=262668288, NumelOut=1050673152, Timeout(ms)=1800000) ran for 1800050 milliseconds before timing out.
> Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:574 (most recent call first):
> frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc8e0788de6 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
> frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc8e1a2f8f2 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
> frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x247 (0x7fc8e1a35f67 in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
> frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc8e1a37d6c in /data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
> frame #4: <unknown function> + 0xdbbf4 (0x7fc92e6dfbf4 in /data/artyom_karpov/miniconda3/bin/../lib/libstdc++.so.6)
> frame #5: <unknown function> + 0x7ea5 (0x7fc9368bbea5 in /lib64/libpthread.so.0)
> frame #6: clone + 0x6d (0x7fc935edbb2d in /lib64/libc.so.6)
> 
> [rank1]:[E618 08:57:17.650442757 ProcessGroupNCCL.cpp:572] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=82098, OpType=_ALLGATHER_BASE, NumelIn=2, NumelOut=8, Timeout(ms)=1800000) ran for 1800085 milliseconds before timing out.
> [rank1]:[E618 08:57:17.650866648 ProcessGroupNCCL.cpp:1587] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 82098, last enqueued NCCL work: 82098, last completed NCCL work: 82097.
> ^M  0%|          | 0/200 [31:11<?, ?it/s]
> [rank1]: Traceback (most recent call last):
> [rank1]:   File "/data/artyom_karpov/rl4steg/train.py", line 518, in <module>
> [rank1]:     main(context)
> [rank1]:   File "/data/artyom_karpov/rl4steg/train.py", line 264, in main
> [rank1]:     stats = ppo_trainer.step(question_tensors, response_tensors, rewards)  # type: ignore
> [rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]:   File "/data/artyom_karpov/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
> [rank1]:     return func(*args, **kwds)
> [rank1]:            ^^^^^^^^^^^^^^^^^^^
> [rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 712, in step
> [rank1]:     model_inputs["input_ids"] = self.accelerator.pad_across_processes(
> [rank1]:                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 2473, in pad_across_processes
> [rank1]:     return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
> [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 414, in wrapper
> [rank1]:     return function(*args, **kwargs)
> [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 681, in pad_across_processes
> [rank1]:     return recursively_apply(
> [rank1]:            ^^^^^^^^^^^^^^^^^^
> [rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
> [rank1]:     return func(data, *args, **kwargs)
> [rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 671, in _pad_across_processes
> [rank1]:     new_tensor = tensor.new_zeros(tuple(new_size)) + pad_index
> [rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> [rank1]: RuntimeError: Storage size calculation overflowed with sizes=[1, 4623015400198258675]

Expected behavior

It does the PPOTrainer.step() successfully.

artkpv commented 3 months ago

Or it fails with CUDA OOM:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/artyom_karpov/rl4steg/train.py", line 530, in <module>
[rank1]:     main(context)
[rank1]:   File "/data/artyom_karpov/rl4steg/train.py", line 276, in main
[rank1]:     stats = ppo_trainer.step(question_tensors, response_tensors, rewards)  # type: ignore
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/miniconda3/lib/python3.11/contextlib.py", line 81, in inner
[rank1]:     return func(*args, **kwds)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 712, in step
[rank1]:     model_inputs["input_ids"] = self.accelerator.pad_across_processes(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 2482, in pad_across_processes
[rank1]:     return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 414, in wrapper
[rank1]:     return function(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 682, in pad_across_processes
[rank1]:     return recursively_apply(
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
[rank1]:     return func(data, *args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 662, in _pad_across_processes
[rank1]:     sizes = gather(size).cpu()
[rank1]:             ^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 390, in wrapper
[rank1]:     output = gather_object([shapes])
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 465, in gather_object
[rank1]:     return _gpu_gather_object(object)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 446, in _gpu_gather_object
[rank1]:     torch.distributed.all_gather_object(output_objects, object)
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/artyom_karpov/rl4steg/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2439, in all_gather_object
[rank1]:     input_tensor.resize_(max_object_size)
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dai0-2 commented 1 month ago

did you get the correct answer? I have the same problem.

artkpv commented 1 month ago

did you get the correct answer? I have the same problem.

@Dai0-2 No, I didn't find a solution to it. I think, I should have tried to supply padded tensors there, i.e. tensors of the same size. It can be padding on the left so all tensors have queries to the right. Then responses can be concatenated. Perhaps that can avoid this padding across devices that results in this exception.

huggingface / accelerate