Closed rileybolen closed 1 month ago
did you also upload validation data or just training data?
I only uploaded training data, it looked like it automatically did the train/val split. I did find an image that was listed in my metadata twice, so I am wondering if maybe it one of those entries ended up in validation and one in training, causing the image to not be found in the validation set. I fixed this and I am trying again. I can also try manually splitting and uploading my validation data. I will let you know if that fixes the error.
it does auto splitting. that shouldnt be an issue.
I did find an image that was listed in my metadata twice, so I am wondering if maybe it one of those entries ended up in validation and one in training, causing the image to not be found in the validation set. I fixed this and I am trying again
please let me know. this case should be caught earlier
@abhishekkrthakur I tried removing the duplicated image record from metadata.jsonl
and I still got the same error.
okay. so the issue is happening for datasets that have a single class. im fixing the issue and will update here asap. i really hope it works for you end to end now. and deep apologies.
@abhishekkrthakur Sounds good, thanks! And no problem, I'm glad I can help test a new feature.
just pushed a fix and tried on my own please make sure you are on v0.7.110 or above.
please let me know if you still face issues
It seems that the training has worked, thanks! I am just facing issues now with the Serverless Inference API, but I think that is separate from this repo. So I think this issue is solved now!
The api wont work immediately. Try a few minutes after training is done :) and thank you so much for all the help :)
@rileybolen thank you very much for helping debugging this and apologies for the inconveniences. As a gratitude, we have added a $25 credit to your hugging face account that you can use for spaces, inference endpoints, autotrain or other huggingface services.
fixed
Prerequisites
Backend
Hugging Face Space/Endpoints
Interface Used
UI
CLI Command
No response
UI Screenshots & Parameters
Error Logs
100%|██████████| 13/13 [00:10<00:00, 1.51it/s]/app/env/lib/python3.10/site-packages/autotrain/trainers/object_detection/utils.py:158: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /opt/conda/conda-bld/pytorch_1712608935911/work/torch/csrc/utils/tensor_new.cpp:274.) batch_image_sizes = torch.tensor([x["orig_size"] for x in batch]) INFO: 10.16.9.183:64413 - "GET /ui/accelerators HTTP/1.1" 200 OK INFO: 10.16.27.38:51108 - "GET /ui/is_model_training HTTP/1.1" 200 OK ERROR | 2024-05-24 14:09:32 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/autotrain/trainers/common.py", line 117, in wrapper return func(*args, *kwargs) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/object_detection/main.py", line 199, in train trainer.train() File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 2311, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 2721, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in evaluate output = eval_loop( File "/app/env/lib/python3.10/site-packages/transformers/trainer.py", line 3854, in evaluation_loop metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels)) File "/app/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/object_detection/utils.py", line 188, in object_detection_metrics for class_id, class_map, class_mar in zip(classes, map_per_class, mar_100_per_class): File "/app/env/lib/python3.10/site-packages/torch/_tensor.py", line 1047, in iter raise TypeError("iteration over a 0-d tensor") TypeError: iteration over a 0-d tensor
ERROR | 2024-05-24 14:09:32 | autotrain.trainers.common:wrapper:121 - iteration over a 0-d tensor INFO | 2024-05-24 14:09:32 | autotrain.trainers.common:pause_space:77 - Pausing space...
33%|███▎ | 100/300 [01:32<03:04, 1.08it/s]
Additional Information
The training is able to start and make some progress, but it seems that after the first epoch of training is completed the training fails with this error.