Open erickrf opened 9 months ago
Hi @erickrf, thanks for raising this issue!
Could you provide some more information about the crashing behaviour? Specifically, are you seeing any error messages, or is the processor just killed?
Could you provide a minimal code snippet we can run to reproduce the error e.g. with a sample of data being passed to the model with e.g. a public dataset?
Sure! I basically get the error mentioned above.
This snippet can replicate the problem (it's rather long but from the tutorial on object detection):
from transformers import DetrImageProcessor, DetrForObjectDetection, TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
cppe5 = load_dataset("cppe-5")
categories = cppe5['train'].features['objects'].feature['category'].names
id2label = {index: x for index, x in enumerate(categories, start=0)}
label2id = {v: k for k, v in id2label.items()}
model_name = "facebook/detr-resnet-50"
image_processor = DetrImageProcessor.from_pretrained(model_name)
detr = DetrForObjectDetection.from_pretrained(
model_name,
id2label=id2label,
label2id=label2id,
ignore_mismatched_sizes=True,
num_queries=5
)
def formatted_anns(image_id, category, area, bbox):
annotations = []
for i in range(0, len(category)):
new_ann = {
"image_id": image_id,
"category_id": category[i],
"isCrowd": 0,
"area": area[i],
"bbox": list(bbox[i]),
}
annotations.append(new_ann)
return annotations
def transform_aug_ann(examples):
image_ids = examples["image_id"]
images, bboxes, area, categories = [], [], [], []
for image, objects in zip(examples["image"], examples["objects"]):
image = np.array(image.convert("RGB"))[:, :, ::-1]
area.append(objects["area"])
images.append(image)
bboxes.append(objects["bbox"])
categories.append(objects["category"])
targets = [
{"image_id": id_, "annotations": formatted_anns(id_, cat_, ar_, box_)}
for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes)
]
return image_processor(images=images, annotations=targets, return_tensors="pt")
def collate_fn(batch):
pixel_values = [item["pixel_values"] for item in batch]
encoding = image_processor.pad(pixel_values, return_tensors="pt")
labels = [item["labels"] for item in batch]
batch = {}
batch["pixel_values"] = encoding["pixel_values"]
batch["pixel_mask"] = encoding["pixel_mask"]
batch["labels"] = labels
return batch
cppe5["train"] = cppe5["train"].with_transform(transform_aug_ann)
training_args = TrainingArguments(
output_dir="model/tests",
per_device_train_batch_size=4,
num_train_epochs=10,
fp16=False,
save_steps=200,
logging_steps=200,
learning_rate=1e-5,
weight_decay=1e-4,
save_total_limit=1,
remove_unused_columns=False,
)
trainer = Trainer(
model=detr,
args=training_args,
data_collator=collate_fn,
train_dataset=cppe5["train"],
tokenizer=image_processor,
)
trainer.train()
I have encountered this problem as well. When trying to change num_queries parameter it sometimes gives NAs and even when it runs it is unable to train. To try it out and test everything before I ran it on the whole dataset, I tried to overfit on a single image(just giving it the same image and targets on each run) but it couldn't do it in 5000 steps. Num_queries=100 worked like a charm both when starting from pretrained or without pretrained(again overfitting on a single image).
Also I found out that using a smaller learning rate fixed the Nan issue
I have looked a bit more attentively into the original DETR paper, and it says (Section 3.1):
DETR infers a fixed-size set of N predictions, in a single pass through the decoder, where N is set to be significantly larger than the typical number of objects in an image.
I couldn't find any analysis of the impact of this number N
, but now I see that lowering it so much is expected to hurt the model.
Still, I would expect rather a bad performance than outright nan values.
I've looked into this quiet deeply, training with different num_queries parameters from scratch, from finetuned version etc. and found that copying the weights of num_queries is useful when num_queries is being initialized with <100 queries. So for example if it is initialized with num_queries=50, copying first 50 queries helps with training and doesn't produce nans.
@amyeroberts I can submit a PR if possible for this change(when num_queries initialized with <100 queries, to copy the first n weights). It greatly speeds up training from what I have tried.
Hi @Isalia20, thanks for digging into the behaviour of num_queries
and training!
I don't think this is something we want to add in on the transformers side. The reason being that it breaks with convention of how weights are normally loaded with our models: a change in config value which causes a change in shape results in a new weight being initialized. Changing this would change assumptions about the model loading behaviour in the library.
It sounds very useful however, please feel free to share the code or a link to an example here for the community.
I'm currently finetuning on SKU110K dataset with 400 num_queries. Once training is finished, I'll upload the model/code to HF/Github. Should I share the instructions here or is there some place else better to share?
@Isalia20 Whereever you think is best. I'd suggest sharing here, or linking to a relevant blog / repo with example code. Another great place would be on the forums.
I've released the model here: https://huggingface.co/isalia99/detr-resnet-50-sku110k and code is here: https://github.com/Isalia20/DETR-finetune
Hello @Isalia20 , @amyeroberts
I am facing a similar issue but with the main difference is that the output of the model is not Nan but it does not respect the x1,y1,x2,y2 format.
Let me add this link to a similar issue found by another user on Huggingface discussions here
Is the same solution convenient to resolve the issue?
I am trying to increase the learning rate to accelerate training. I have the following specifics:
In your opinion:
AFAIK, the model requires x_center, y_center, width, height (in relative coordinates to image) to train
@Isalia20 But the error mentioned in this issue is mainly due to the the bboxes1 (the output of the model) and not to the bboxes2 (the target bboxes)
[not related to this issue ] in this case, the notebook of Niels found here is missing a step to convert the input from x1,y1,w,h to x_center,y_center, width, height?
Nevermind, it's actually x1, y1, w, h in relative coords and that notebook does have it correctly. My best advice would be to train with already pretrained num_queries=100 and have a small learning rate(1e-5 for the head and freeze backbone). In that case Nan issues didn't occur for me. If they still occur maybe sharing your code will help us to debug it(if possible)
@Isalia20
I am facing the exact same issue that mentioned here. You can find below the error I have been having. After further investigation, high learning rate can reveal this type of error,. I will stick for the time being for an error of 1.e-4, without a lot of warmup.
On the other hand, regarding the boxes input/output format, it is worth noting that the input of the model is also of cx,cy,w,h format. In HF notebook, the conversion is done when transforming in this line of code
transform = albumentations.Compose(
[
albumentations.Resize(480, 480),
albumentations.HorizontalFlip(p=1.0),
albumentations.RandomBrightnessContrast(p=1.0),
],
bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]), # here the bbox are converted from x,y,wh to cx,cy,w,h
)
ValueError: boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([[0.0067, 0.1018, 0.1296, 0.8076],
[0.3481, 0.0247, 0.7026, 0.2710],
[0.0161, 0.2329, 0.3252, 0.9087],
...,
[0.2112, 0.0206, 0.9541, 0.1913],
[0.3584, 0.0234, 0.9580, **1.0029**],
[0.3655, 0.0252, 0.8555, 0.2568]], device='cuda:0',
dtype=torch.float16)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File <command--1>:8
5 del sys
7 with open(filename, "rb") as f:
----> 8 exec(compile(f.read(), filename, 'exec'))
417 if (
418 active_session_failed
419 or autologging_is_disabled(autologging_integration)
(...)
426 # warning behavior during original function execution, since autologging is being
427 # skipped
428 with set_non_mlflow_warnings_behavior_for_current_thread(
429 disable_warnings=False,
430 reroute_warnings=False,
431 ):
--> 432 return original(*args, **kwargs)
434 # Whether or not the original / underlying function has been called during the
435 # execution of patched code
436 original_has_been_called = False
File /databricks/python/lib/python3.9/site-packages/transformers/trainer.py:1555, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1553 hf_hub_utils.enable_progress_bars()
1554 else:
-> 1555 return inner_training_loop(
1556 args=args,
1557 resume_from_checkpoint=resume_from_checkpoint,
1558 trial=trial,
1559 ignore_keys_for_eval=ignore_keys_for_eval,
1560 )
File /databricks/python/lib/python3.9/site-packages/transformers/trainer.py:1860, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1857 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
1859 with self.accelerator.accumulate(model):
-> 1860 tr_loss_step = self.training_step(model, inputs)
1862 if (
1863 args.logging_nan_inf_filter
1864 and not is_torch_tpu_available()
1865 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1866 ):
1867 # if loss is nan or inf simply add the average of previous logged losses
1868 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File /databricks/python/lib/python3.9/site-packages/transformers/trainer.py:2725, in Trainer.training_step(self, model, inputs)
2722 return loss_mb.reduce_mean().detach().to(self.args.device)
2724 with self.compute_loss_context_manager():
-> 2725 loss = self.compute_loss(model, inputs)
2727 if self.args.n_gpu > 1:
2728 loss = loss.mean() # mean() to average on multi-gpu parallel training
File /databricks/python/lib/python3.9/site-packages/transformers/trainer.py:2748, in Trainer.compute_loss(self, model, inputs, return_outputs)
2746 else:
2747 labels = None
-> 2748 outputs = model(**inputs)
2749 # Save past state if it exists
2750 # TODO: this needs to be fixed and made cleaner later.
2751 if self.args.past_index >= 0:
File /databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File /databricks/python/lib/python3.9/site-packages/accelerate/utils/operations.py:687, in convert_outputs_to_fp32.<locals>.forward(*args, **kwargs)
686 def forward(*args, **kwargs):
--> 687 return model_forward(*args, **kwargs)
File /databricks/python/lib/python3.9/site-packages/accelerate/utils/operations.py:675, in ConvertOutputsToFp32.__call__(self, *args, **kwargs)
674 def __call__(self, *args, **kwargs):
--> 675 return convert_to_fp32(self.model_forward(*args, **kwargs))
File /databricks/python/lib/python3.9/site-packages/torch/amp/autocast_mode.py:14, in autocast_decorator.<locals>.decorate_autocast(*args, **kwargs)
11 @functools.wraps(func)
12 def decorate_autocast(*args, **kwargs):
13 with autocast_instance:
---> 14 return func(*args, **kwargs)
....
47 def forward(
48 self,
49 pixel_values: torch.FloatTensor,
(...)
59 format_labels_val=None,
60 ):
---> 62 output = super().forward(
63 pixel_values=pixel_values,
64 pixel_mask=pixel_mask,
65 decoder_attention_mask=decoder_attention_mask,
66 encoder_outputs=encoder_outputs,
67 inputs_embeds=inputs_embeds,
68 decoder_inputs_embeds=decoder_inputs_embeds,
69 labels=labels,
70 output_attentions=output_attentions,
71 output_hidden_states=output_hidden_states,
72 return_dict=return_dict,
73 )
75 return CustomDetrObjectDetectionOutput(
76 **output.__dict__, format_labels_val=format_labels_val
77 )
File /databricks/python/lib/python3.9/site-packages/transformers/models/detr/modeling_detr.py:1603, in DetrForObjectDetection.forward(self, pixel_values, pixel_mask, decoder_attention_mask, encoder_outputs, inputs_embeds, decoder_inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
1600 auxiliary_outputs = self._set_aux_loss(outputs_class, outputs_coord)
1601 outputs_loss["auxiliary_outputs"] = auxiliary_outputs
-> 1603 loss_dict = criterion(outputs_loss, labels)
1604 # Fourth: compute total loss, as a weighted sum of the various losses
1605 weight_dict = {"loss_ce": 1, "loss_bbox": self.config.bbox_loss_coefficient}
File /databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File /databricks/python/lib/python3.9/site-packages/transformers/models/detr/modeling_detr.py:2202, in DetrLoss.forward(self, outputs, targets)
2199 outputs_without_aux = {k: v for k, v in outputs.items() if k != "auxiliary_outputs"}
2201 # Retrieve the matching between the outputs of the last layer and the targets
-> 2202 indices = self.matcher(outputs_without_aux, targets)
2204 # Compute the average number of target boxes across all nodes, for normalization purposes
2205 num_boxes = sum(len(t["class_labels"]) for t in targets)
File /databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File /databricks/python/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
File /databricks/python/lib/python3.9/site-packages/transformers/models/detr/modeling_detr.py:2323, in DetrHungarianMatcher.forward(self, outputs, targets)
2320 bbox_cost = torch.cdist(out_bbox, target_bbox, p=1)
2322 # Compute the giou cost between boxes
-> 2323 giou_cost = -generalized_box_iou(center_to_corners_format(out_bbox), center_to_corners_format(target_bbox))
2325 # Final cost matrix
2326 cost_matrix = self.bbox_cost * bbox_cost + self.class_cost * class_cost + self.giou_cost * giou_cost
File /databricks/python/lib/python3.9/site-packages/transformers/models/detr/modeling_detr.py:2388, in generalized_box_iou(boxes1, boxes2)
2385 # degenerate boxes gives inf / nan results
2386 # so do an early check
2387 if not (boxes1[:, 2:] >= boxes1[:, :2]).all():
-> 2388 raise ValueError(f"boxes1 must be in [x0, y0, x1, y1] (corner) format, but got {boxes1}")
2389 if not (boxes2[:, 2:] >= boxes2[:, :2]).all():
2390 raise ValueError(f"boxes2 must be in [x0, y0, x1, y1] (corner) format, but got {boxes2}")
ValueError: boxes1 must be in [x0, y0, x1, y1] (corner) format, but got tensor([[0.0067, 0.1018, 0.1296, 0.8076],
[0.3481, 0.0247, 0.7026, 0.2710],
[0.0161, 0.2329, 0.3252, 0.9087],
...,
[0.2112, 0.0206, 0.9541, 0.1913],
[0.3584, 0.0234, 0.9580, **1.0029**],
[0.3655, 0.0252, 0.8555, 0.2568]], device='cuda:0',
dtype=torch.float16)
System Info
transformers
version: 4.36.2Who can help?
@amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
num_queries
hyperparameter.labels
)I got the following error
The same code works fine without changing the default
num_queries
.Expected behavior
I would expect the model to run as normal.
I am fine tuning the model in a custom dataset which should not have more than a couple of objects per image, and expected the number of queries to have no impact other than limiting the maximum number of objects found.