Closed sam-hieken closed 1 year ago
It looks like your model did not return any loss. You should use the forums to help debug your training code.
It looks like your model did not return any loss. You should use the forums to help debug your training code.
@sgugger Right, the loss is None; however, I ran the exact same model with no issues when using plain transformers without Accelerate. This definitely looks like it's an accelerate specific issue.
Here's my model (again, I used the exact same model with and without accelerate):
xl_conf = XLNetConfig(
vocab_size=tokenizer.vocab_size,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
num_labels=2
)
model = XLNetForSequenceClassification(xl_conf)
... and here's the Trainer / TrainingArguments that were successful without accelerate:
training_args = TrainingArguments(
evaluation_strategy="epoch",
optim="adamw_torch",
output_dir="./Classifier",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=1,
save_steps=10_000,
save_total_limit=1, # How many "checkpoints" to save at a time
gradient_accumulation_steps=1
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_data,
# eval_dataset=valid_hg,
tokenizer=tokenizer
)
We cannot reproduce the code you paste (what is train_data
?) so I don't see how you want us to help.
@sgugger I didn't include train_data
before because it's just a typical text classification formatted Dataset
. But if it's necessary,
Dataset({
features: ['text', 'input_ids', 'token_type_ids', 'attention_mask', 'label'],
num_rows: 10000
})
And here's an example of a row:
{'input_ids': tensor([ 5, 5, 5, ..., 16, 4, 3]), 'token_type_ids': tensor([3, 3, 3, ..., 0, 0, 2]), 'attention_mask': tensor([0, 0, 0, ..., 1, 1, 1]), 'label': tensor(1)}
All text sequences were tokenized and padded with a max length of 1024 (tokenizer(line['text'], padding='max_length', max_length=1024)
). Let me know if you need any other details.
So after testing the same model without accelerate, but with a custom training loop, I still received None
for outputs.loss
. For anyone with the same problem, I believe I solved it with a custom loss function (I chose cross entropy)
loss_function = torch.nn.CrossEntropyLoss()
... and replacing the line
loss = outputs.loss
with a call to the newly defined loss function
loss = loss_function(outputs.logits, batch['label'])
I'm not too familiar with lower level PyTorch stuff, so if anyone wants to comment on its efficacy I'd appreciate it. I'm not sure why outputs.loss
is None
here, but I'm going to close this issue since it seems to be related to transformers.
accelerate==0.24.0
or accelerate==0.28.0
.# reprod.py
import transformers
import accelerate
from accelerate import Accelerator, DistributedDataParallelKwargs
from accelerate.state import AcceleratorState
import torch
class DummyDataset(torch.utils.data.Dataset):
def __init__(self, length, max_length):
self.length = length
self.max_length = max_length
def __len__(self):
return self.length
def __getitem__(self, idx):
return {
"input_ids": torch.randint(0, 100, (self.max_length,)),
"attention_mask": torch.ones(self.max_length),
"labels": torch.randint(0, 100, (self.max_length,)),
}
class DummyModel(transformers.GPT2LMHeadModel):
def __init__(self, config):
super().__init__(config)
self.cached_inputs = None
self.linear = torch.nn.Linear(10, 20)
def forward(self, **kwargs):
labels = kwargs.get("labels")
del kwargs["labels"]
with torch.cuda.amp.autocast(enabled=False):
outputs = super().forward(**kwargs)
if self.cached_inputs is None: # some dummy computation
self.cached_inputs = kwargs["input_ids"].detach().clone()
else:
# print(self.cached_inputs)
pass
print(outputs.loss)
loss = None
lm_logits = outputs.logits
# labels = kwargs.get("labels")
if labels is not None:
# move labels to correct device to enable model parallelism
labels = labels.to(lm_logits.device)
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
loss_fct = torch.nn.CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
shift_labels = shift_labels.to(shift_logits.device)
# Flatten the tokens
loss = loss_fct(shift_logits, shift_labels)
outputs.loss = loss # here it is still a tensor
return outputs
def main():
# model: transformers.GPT2LMHeadModel = transformers.GPT2LMHeadModel(transformers.GPT2Config())
model = DummyModel(transformers.GPT2Config())
# to bfloat16
# for param in model.parameters():
# param.data = param.data.to(torch.bfloat16)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
batch_size = 16
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(gradient_accumulation_steps=32, kwargs_handlers=[ddp_kwargs])
print(accelerator.state.distributed_type)
dataset = DummyDataset(100, 30)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)
history_losses = []
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
model.train()
for batch in dataloader:
with accelerator.accumulate(model):
outputs = model(**batch)
loss = outputs.loss # here it will become None!!!
accelerator.backward(loss)
with torch.no_grad():
loss_mean = accelerator.gather(loss.unsqueeze(0)).mean()
history_losses.append(loss_mean.item())
if accelerator.is_local_main_process:
print(loss_mean.item())
print(loss.item())
optimizer.step()
optimizer.zero_grad()
if __name__ == "__main__":
main()
just run it with simple DDP config
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
with command line
accelerate launch reprod.py
Traceback (most recent call last):
File "/scratch/generalvision/xxx/ScanQA/reprod-deepspeed.py", line 115, in <module>
main()
File "/scratch/generalvision/xxx/ScanQA/reprod-deepspeed.py", line 91, in main
accelerator.backward(loss)
File "/scratch/xxx/.conda/envs/llama/lib/python3.10/site-packages/accelerate/accelerator.py", line 1981, in backward
loss = loss / self.gradient_accumulation_steps
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
The only code change that cause the problem is that, the loss computation is moved out from inner GPT2 model (or any other model with transformers
formatted output).
remove that computation code and add back the labels for inner GPT2 forward works fine, but a custom loss would inevitably require such computation code without hacking the code in the transformers
package.
The deeper root might be related to the accelerator.prepare_model
method here, that wraps the forward with a convert_outputs_to_fp32
method, which somehow let the loss becomes None.
To dig deeper, i think this is due to the comprehension operation that init transformers package's model output class with itself, that is triggered in convert_outputs_to_fp32. a simple snippet (that seems to be working) is buggy.
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
c = CausalLMOutputWithCrossAttentions(loss=1, logits=2, past_key_values=3)
print(f"{c=}")
c.loss = None
print(f"{c=}")
d = CausalLMOutputWithCrossAttentions({k: v for k, v in c.items()})
print(f"{d=}")
and the result is
c=CausalLMOutputWithCrossAttentions(loss=1, logits=2, past_key_values=3, hidden_states=None, attentions=None, cross_attentions=None)
c=CausalLMOutputWithCrossAttentions(loss=None, logits=2, past_key_values=3, hidden_states=None, attentions=None, cross_attentions=None)
d=CausalLMOutputWithCrossAttentions(loss=1, logits=2, past_key_values=3, hidden_states=None, attentions=None, cross_attentions=None)
a correct way to do this is d = CausalLMOutputWithCrossAttentions({k: v for k, v in c.__dict__.items()})
Same problem when running official example in accelerate/examples/inference/llama.py.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.44s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.72s/it]
Traceback (most recent call last):
File "/data/train/accelerate/examples/inference/llama.py", line 39, in <module>
model = prepare_pippy(model, split_points="auto", example_args=inputs)
File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/inference.py", line 159, in prepare_pippy
stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/inference.py", line 78, in build_pipeline
args = pad_input_tensors(args, found_batch_size, num_chunks)
File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 714, in pad_input_tensors
return recursively_apply(
File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 127, in recursively_apply
{
File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 128, in <dictcomp>
k: recursively_apply(
File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 135, in recursively_apply
return func(data, *args, **kwargs)
File "/home/jeeves/.conda/envs/llama/lib/python3.10/site-packages/accelerate/utils/operations.py", line 696, in _pad_input_tensors
remainder = batch_size // num_processes
TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Hello,
I've been receiving the following error with Accelerate after using the above configuration:
Previously, I had ran the same script on the exact same configuration except multi-GPU, and received the same error, but with
dict
andint
instead ofNoneType
andint
.My code is as follows:
I also tried setting
ACCELERATE_GRADIENT_ACCUMULATION_STEPS
as an environment variable, but that didn't effect anything.Please note that originally, I didn't use gradient accumulation with accelerate, and still received the same error.
Expected behavior
N/A