huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.63k stars 441 forks source link

[BUG] Object Detection AutoTrain Error: iteration over a 0-d tensor #656

Closed rileybolen closed 1 month ago

rileybolen commented 1 month ago

Prerequisites

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

Screenshot 2024-05-22 at 8 01 39 AM

Error Logs

100%|██████████| 13/13 [00:10<00:00, 1.51it/s]/app/env/lib/python3.10/site-packages/autotrain/trainers/object_detection/utils.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /opt/conda/conda-bld/pytorch_1712608935911/work/torch/csrc/utils/tensor_new.cpp:274.) batch_image_sizes = torch.tensor([x["orig_size"] for x in batch]) INFO: 10.16.9.183:64413 - "GET /ui/accelerators HTTP/1.1" 200 OK INFO: 10.16.27.38:51108 - "GET /ui/is_model_training HTTP/1.1" 200 OK ERROR | 2024-05-24 14:09:32 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/autotrain/trainers/common.py", line 117, in wrapper return func(*args, *kwargs) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/object_detection/main.py", line 199, in train trainer.train() File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 2311, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in evaluate output = eval_loop( File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 3854, in evaluation_loop metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels)) File "/app/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/object_detection/utils.py", line 188, in object_detection_metrics for class_id, class_map, class_mar in zip(classes, map_per_class, mar_100_per_class): File "/app/env/lib/python3.10/site-packages/torch/_tensor.py", line 1047, in iter raise TypeError("iteration over a 0-d tensor") TypeError: iteration over a 0-d tensor

ERROR | 2024-05-24 14:09:32 | autotrain.trainers.common:wrapper:121 - iteration over a 0-d tensor INFO | 2024-05-24 14:09:32 | autotrain.trainers.common:pause_space:77 - Pausing space...

33%|███▎ | 100/300 [01:32<03:04, 1.08it/s]

Additional Information

The training is able to start and make some progress, but it seems that after the first epoch of training is completed the training fails with this error.

abhishekkrthakur commented 1 month ago

did you also upload validation data or just training data?

rileybolen commented 1 month ago

I only uploaded training data, it looked like it automatically did the train/val split. I did find an image that was listed in my metadata twice, so I am wondering if maybe it one of those entries ended up in validation and one in training, causing the image to not be found in the validation set. I fixed this and I am trying again. I can also try manually splitting and uploading my validation data. I will let you know if that fixes the error.

abhishekkrthakur commented 1 month ago

it does auto splitting. that shouldnt be an issue.

I did find an image that was listed in my metadata twice, so I am wondering if maybe it one of those entries ended up in validation and one in training, causing the image to not be found in the validation set. I fixed this and I am trying again

please let me know. this case should be caught earlier

rileybolen commented 1 month ago

@abhishekkrthakur I tried removing the duplicated image record from metadata.jsonl and I still got the same error.

abhishekkrthakur commented 1 month ago

okay. so the issue is happening for datasets that have a single class. im fixing the issue and will update here asap. i really hope it works for you end to end now. and deep apologies.

rileybolen commented 1 month ago

@abhishekkrthakur Sounds good, thanks! And no problem, I'm glad I can help test a new feature.

abhishekkrthakur commented 1 month ago

just pushed a fix and tried on my own please make sure you are on v0.7.110 or above.

abhishekkrthakur commented 1 month ago

please let me know if you still face issues

rileybolen commented 1 month ago

It seems that the training has worked, thanks! I am just facing issues now with the Serverless Inference API, but I think that is separate from this repo. So I think this issue is solved now!

abhishekkrthakur commented 1 month ago

The api wont work immediately. Try a few minutes after training is done :) and thank you so much for all the help :)

abhishekkrthakur commented 1 month ago

@rileybolen thank you very much for helping debugging this and apologies for the inconveniences. As a gratitude, we have added a $25 credit to your hugging face account that you can use for spaces, inference endpoints, autotrain or other huggingface services.

abhishekkrthakur commented 1 month ago

fixed