huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.39k stars 879 forks source link

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int' #1366

Closed sam-hieken closed 1 year ago

sam-hieken commented 1 year ago

System Info

- `Accelerate` version: 0.18.0
- Platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.9.12
- Numpy version: 1.22.4
- PyTorch version (GPU?): 2.0.0+cu117 (False)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_CPU
        - mixed_precision: fp16
        - use_cpu: True
        - num_processes: 24
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

Hello,

I've been receiving the following error with Accelerate after using the above configuration:

/home/hiekense/.local/lib/python3.9/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
  0%|                                                                                         | 0/10000 [00:00<?, ?it/s]Starting training...
Traceback (most recent call last):
  File "/home/hiekense/model/train-accel.py", line 87, in <module>
    accelerator.backward(loss)
  File "/home/hiekense/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1675, in backward
    loss = loss / self.gradient_accumulation_steps
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
  0%|                                                                                         | 0/10000 [00:16<?, ?it/s]Traceback (most recent call last):
  File "/home/hiekense/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/hiekense/.local/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/hiekense/.local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 923, in launch_command
    simple_launcher(args)
  File "/home/hiekense/.local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 579, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

Previously, I had ran the same script on the exact same configuration except multi-GPU, and received the same error, but with dict and int instead of NoneType and int.

My code is as follows:

optimizer = AdamW(model.parameters(), lr=5e-5)

train_dl = DataLoader(
        train_data, shuffle=True, batch_size=1
)

epochs = 1
training_steps = epochs * len(train_dl)
scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=training_steps
)

progress_bar = tqdm(range(training_steps))

accelerator = Accelerator(gradient_accumulation_steps=2)

model = accelerator.prepare(model)
optimizer, train_dl, scheduler = accelerator.prepare(
        optimizer, train_dl, scheduler
)

print("Starting training...")
model.train()
for epoch in range(epochs):
        for batch in train_dl:
                with accelerator.accumulate(model):
                        # Run a batch through the model
                        outputs = model(**batch)
                        loss = outputs.loss
                        accelerator.backward(loss)

                        optimizer.step()
                        scheduler.step()
                        optimizer.zero_grad()
                        progress_bar.update(1)

I also tried setting ACCELERATE_GRADIENT_ACCUMULATION_STEPS as an environment variable, but that didn't effect anything.

Please note that originally, I didn't use gradient accumulation with accelerate, and still received the same error.

Expected behavior

N/A

sgugger commented 1 year ago

It looks like your model did not return any loss. You should use the forums to help debug your training code.

sam-hieken commented 1 year ago

It looks like your model did not return any loss. You should use the forums to help debug your training code.

@sgugger Right, the loss is None; however, I ran the exact same model with no issues when using plain transformers without Accelerate. This definitely looks like it's an accelerate specific issue.

Here's my model (again, I used the exact same model with and without accelerate):

xl_conf = XLNetConfig(
        vocab_size=tokenizer.vocab_size,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        num_labels=2
)

model = XLNetForSequenceClassification(xl_conf)

... and here's the Trainer / TrainingArguments that were successful without accelerate:

training_args = TrainingArguments(
        evaluation_strategy="epoch",
        optim="adamw_torch",
        output_dir="./Classifier",
        overwrite_output_dir=True,
        num_train_epochs=1,
        per_device_train_batch_size=1,
        save_steps=10_000,
        save_total_limit=1, # How many "checkpoints" to save at a time
        gradient_accumulation_steps=1
)

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
#       eval_dataset=valid_hg,
        tokenizer=tokenizer
)
sgugger commented 1 year ago

We cannot reproduce the code you paste (what is train_data?) so I don't see how you want us to help.

sam-hieken commented 1 year ago

@sgugger I didn't include train_data before because it's just a typical text classification formatted Dataset. But if it's necessary,

Dataset({
    features: ['text', 'input_ids', 'token_type_ids', 'attention_mask', 'label'],
    num_rows: 10000
})

And here's an example of a row:

{'input_ids': tensor([ 5,  5,  5,  ..., 16,  4,  3]), 'token_type_ids': tensor([3, 3, 3,  ..., 0, 0, 2]), 'attention_mask': tensor([0, 0, 0,  ..., 1, 1, 1]), 'label': tensor(1)}

All text sequences were tokenized and padded with a max length of 1024 (tokenizer(line['text'], padding='max_length', max_length=1024)). Let me know if you need any other details.

sam-hieken commented 1 year ago

So after testing the same model without accelerate, but with a custom training loop, I still received None for outputs.loss. For anyone with the same problem, I believe I solved it with a custom loss function (I chose cross entropy)

loss_function = torch.nn.CrossEntropyLoss()

... and replacing the line

loss = outputs.loss

with a call to the newly defined loss function

loss = loss_function(outputs.logits, batch['label'])

I'm not too familiar with lower level PyTorch stuff, so if anyone wants to comment on its efficacy I'd appreciate it. I'm not sure why outputs.loss is None here, but I'm going to close this issue since it seems to be related to transformers.

matthewdm0816 commented 4 months ago

After some frustrations, I tried to make a reproducer, validated on either accelerate==0.24.0 or accelerate==0.28.0.

The reproducer

# reprod.py
import transformers
import accelerate
from accelerate import Accelerator, DistributedDataParallelKwargs
from accelerate.state import AcceleratorState
import torch

class DummyDataset(torch.utils.data.Dataset):
    def __init__(self, length, max_length):
        self.length = length
        self.max_length = max_length
    def __len__(self):
        return self.length
    def __getitem__(self, idx):
        return {
            "input_ids": torch.randint(0, 100, (self.max_length,)),
            "attention_mask": torch.ones(self.max_length),
            "labels": torch.randint(0, 100, (self.max_length,)),
        }

class DummyModel(transformers.GPT2LMHeadModel):
    def __init__(self, config):
        super().__init__(config)
        self.cached_inputs = None
        self.linear = torch.nn.Linear(10, 20)

    def forward(self, **kwargs):
        labels = kwargs.get("labels")
        del kwargs["labels"]
        with torch.cuda.amp.autocast(enabled=False):
            outputs = super().forward(**kwargs)

        if self.cached_inputs is None: # some dummy computation
            self.cached_inputs = kwargs["input_ids"].detach().clone()
        else:
            # print(self.cached_inputs)
            pass

        print(outputs.loss)
        loss = None
        lm_logits = outputs.logits
        # labels = kwargs.get("labels")
        if labels is not None:
            # move labels to correct device to enable model parallelism
            labels = labels.to(lm_logits.device)
            # Shift so that tokens < n predict n
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss_fct = torch.nn.CrossEntropyLoss()
            shift_logits = shift_logits.view(-1, self.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            shift_labels = shift_labels.to(shift_logits.device)
            # Flatten the tokens
            loss = loss_fct(shift_logits, shift_labels)
        outputs.loss = loss # here it is still a tensor

        return outputs

def main():
    # model: transformers.GPT2LMHeadModel = transformers.GPT2LMHeadModel(transformers.GPT2Config())
    model = DummyModel(transformers.GPT2Config())
    # to bfloat16
    # for param in model.parameters():
        # param.data = param.data.to(torch.bfloat16)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
    batch_size = 16

    ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
    accelerator = Accelerator(gradient_accumulation_steps=32, kwargs_handlers=[ddp_kwargs])

    print(accelerator.state.distributed_type)

    dataset = DummyDataset(100, 30)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)
    history_losses = []

    model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

    model.train()
    for batch in dataloader:
        with accelerator.accumulate(model):
            outputs = model(**batch)
            loss = outputs.loss # here it will become None!!!
            accelerator.backward(loss)

            with torch.no_grad():
                loss_mean = accelerator.gather(loss.unsqueeze(0)).mean()
                history_losses.append(loss_mean.item())
                if accelerator.is_local_main_process:
                    print(loss_mean.item())

            print(loss.item())
            optimizer.step()
            optimizer.zero_grad()

if __name__ == "__main__":
    main()

just run it with simple DDP config

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

with command line accelerate launch reprod.py

The error

Traceback (most recent call last):
  File "/scratch/generalvision/xxx/ScanQA/reprod-deepspeed.py", line 115, in <module>
    main()
  File "/scratch/generalvision/xxx/ScanQA/reprod-deepspeed.py", line 91, in main
    accelerator.backward(loss)
  File "/scratch/xxx/.conda/envs/llama/lib/python3.10/site-packages/accelerate/accelerator.py", line 1981, in backward
    loss = loss / self.gradient_accumulation_steps
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

The (possible) reason

The only code change that cause the problem is that, the loss computation is moved out from inner GPT2 model (or any other model with transformers formatted output).
remove that computation code and add back the labels for inner GPT2 forward works fine, but a custom loss would inevitably require such computation code without hacking the code in the transformers package.

The deeper root might be related to the accelerator.prepare_model method here, that wraps the forward with a convert_outputs_to_fp32 method, which somehow let the loss becomes None.

matthewdm0816 commented 4 months ago

To dig deeper, i think this is due to the comprehension operation that init transformers package's model output class with itself, that is triggered in convert_outputs_to_fp32. a simple snippet (that seems to be working) is buggy.

from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
c = CausalLMOutputWithCrossAttentions(loss=1, logits=2, past_key_values=3)
print(f"{c=}")
c.loss = None
print(f"{c=}")
d = CausalLMOutputWithCrossAttentions({k: v for k, v in c.items()})
print(f"{d=}")

and the result is

c=CausalLMOutputWithCrossAttentions(loss=1, logits=2, past_key_values=3, hidden_states=None, attentions=None, cross_attentions=None)
c=CausalLMOutputWithCrossAttentions(loss=None, logits=2, past_key_values=3, hidden_states=None, attentions=None, cross_attentions=None)
d=CausalLMOutputWithCrossAttentions(loss=1, logits=2, past_key_values=3, hidden_states=None, attentions=None, cross_attentions=None)

a correct way to do this is d = CausalLMOutputWithCrossAttentions({k: v for k, v in c.__dict__.items()})

jiangzizi commented 4 months ago

Same problem when running official example in accelerate/examples/inference/llama.py.

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.44s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.72s/it]
Traceback (most recent call last):
  File "/data/train/accelerate/examples/inference/llama.py", line 39, in <module>
    model = prepare_pippy(model, split_points="auto", example_args=inputs)
  File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/inference.py", line 159, in prepare_pippy
    stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
  File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/inference.py", line 78, in build_pipeline
    args = pad_input_tensors(args, found_batch_size, num_chunks)
  File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 714, in pad_input_tensors
    return recursively_apply(
  File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 127, in recursively_apply
    {
  File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 128, in <dictcomp>
    k: recursively_apply(
  File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 135, in recursively_apply
    return func(data, *args, **kwargs)
  File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 696, in _pad_input_tensors
    remainder = batch_size // num_processes
TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'