Open unrue opened 1 year ago
Hi @unrue
You need to execute the training from within the fasterrcnn-pytorch-training-pipeline
directory.
I tried, same error.
Oh I see. You are using PyTorch 1.11.
The repository needs PyTorch 1.12 or higher. The weights
argument was present in the previous versions. Previously it was pretrained
.
Yes, now the run starts, but during the training, on the second epoch I get:
pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 435, in check_bbox ValueError: Expected x_min for bbox (tensor(1.1671), tensor(0.3307), tensor(1.), tensor(0.4695), tensor(35)) to be in the range [0.0, 1.0], got 1.1670734882354736. raise ValueError(f"Expected {name} for bbox {bbox} to be in the range [0.0, 1.0], got {value}.")
What does it means? Some wrong bounding boxes (PVOC format) ? How can I retrieve the image name involved?
Yes. Looks like a bounding box issue.
So, it looks like some of the xmin of your dataset is going out of the image. You can upload the dataset to Roboflow and download in the Pascal VOC format. It will automatically correct these issues.
Ok thanks, These are the bounding box getting the error:
`
<height>480</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>head</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficut>0</difficut>
<bndbox>
<xmin>81.37453</xmin>
<ymin>179.94751</ymin>
<xmax>226.36612</xmax>
<ymax>288.7328</ymax>
</bndbox>
</object>
<object>
<name>prayer</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficut>0</difficut>
<bndbox>
<xmin>562.19525</xmin>
<ymin>211.63783</ymin>
<xmax>615.0446</xmax>
<ymax>300.48517</ymax>
</bndbox>
</object>
<object>
<name>hand</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficut>0</difficut>
<bndbox>
<xmin>384.1045</xmin>
<ymin>285.5305</ymin>
<xmax>456.17444</xmax>
<ymax>355.10376</ymax>
</bndbox>
`
I don't understand what's wrong.
There does not seem to be anything wrong here. How did you get that this is the file where the boxes are wrong?
Simply I put a print in dataset.py, used 1 GPU, and batch_size=1:
` def load_image_and_labels(self, index):
image_name = self.all_images[index]
print("IMAGE: ", image_name)
image_path = os.path.join(self.images_path, image_name)
`
And I get:
`
Epoch: [0] [19472/19473] eta: 0:00:00 lr: 0.001000 loss: 0.6977 (0.9383) loss_classifier: 0.3446 (0.4838) loss_box_reg: 0.1704 (0.3233) loss_objectness: 0.0314 (0.0657) loss_rpn_box_reg: 0.0203 (0.0655) time: 0.1073 data: 0.0628 max mem: 1456
Epoch: [0] Total time: 0:34:11 (0.1053 s / it)
.....
IMAGE: 402831.JPG
IMAGE: 40357-H.jpg
IMAGE: 40433-H.jpg
IMAGE: 404371.jpg
IMAGE: 404380.jpg
IMAGE: 40674-H.jpg
IMAGE: 407685.JPG
IMAGE: 408.jpg
Traceback (most recent call last):
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/train.py", line 574, in <module>
main(args)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/train.py", line 426, in main
stats, val_pred_image = evaluate(
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 131, in evaluate
coco = get_coco_api_from_dataset(data_loader.dataset)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/torch_utils/coco_utils.py", line 204, in get_coco_api_from_dataset
return convert_to_coco_api(dataset)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/torch_utils/coco_utils.py", line 152, in convert_to_coco_api
img, targets = ds[img_idx]
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/datasets.py", line 318, in __getitem__
sample = self.transforms(image=image_resized,
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/composition.py", line 207, in __call__
p.preprocess(data)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/utils.py", line 83, in preprocess
data[data_name] = self.check_and_convert(data[data_name], rows, cols, direction="to")
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/utils.py", line 91, in check_and_convert
return self.convert_to_albumentations(data, rows, cols)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 142, in convert_to_albumentations
return convert_bboxes_to_albumentations(data, self.params.format, rows, cols, check_validity=True)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 408, in convert_bboxes_to_albumentations
return [convert_bbox_to_albumentations(bbox, source_format, rows, cols, check_validity) for bbox in bboxes]
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 408, in <listcomp>
return [convert_bbox_to_albumentations(bbox, source_format, rows, cols, check_validity) for bbox in bboxes]
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 352, in convert_bbox_to_albumentations
check_bbox(bbox)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 435, in check_bbox
raise ValueError(f"Expected {name} for bbox {bbox} to be in the range [0.0, 1.0], got {value}.")
ValueError: Expected x_min for bbox (tensor(1.1671), tensor(0.3307), tensor(1.), tensor(0.4695), tensor(35)) to be in the range [0.0, 1.0], got 1.1670734882354736.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 361271) of binary: ****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00019693374633789062 seconds
Traceback (most recent call last):
File "/leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/python-3.10.8-c7fmxco5bavqi3ye7hrbaxpjpwv6dcxd/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/python-3.10.8-c7fmxco5bavqi3ye7hrbaxpjpwv6dcxd/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
`
Ok. To get the correct image name, you need to make --workers 0
to run the process on the main thread. Else the file names will be jumbled up because of multi threading.
Yes, I forgot to mention I use also --workers 0
If you are on a multi-GPU system, can you pass --device cuda:0
?
Because I cannot find any issues with the above XML file and generally it should print the erroneous file in this way. I have done it before.
I used export CUDA_VISIBLE_DEVICES=0
The problem seems to start with the evaluation. No idea why.
Oh. So, was the training loop completed successfully?
If so, please try --device cuda:0
so that the distributed system variables are not enabled.
Do let me know if it prints the correct image path. Else, I will try to add code that will show the erroneous file.
I think the first training epoch is finished, because I read:
Epoch: [0] [19472/19473]
and from error stack trace is present evaluate method. Anyway, I launched as you suggested, it takes some time. Tomorrow I'll post the results. Thanks for the moment.
No problem. I will try to figure out a way to know such files easily.
Hi,
test finished, same error on the same image. THis is the complete xml in case you want to debug:
`
<filename>408.jpg</filename>
<path>beni_culturali_negative_coords_translated/beni_culturali_orig/dipinto/ritratto/OA_3.00_ICCD0_exp_iccd.021663793507515/408.jpg</path>
<source>Unknown</source>
<size>
<width>640</width>
<height>480</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>head</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficut>0</difficut>
<bndbox>
<xmin>81.37453</xmin>
<ymin>179.94751</ymin>
<xmax>226.36612</xmax>
<ymax>288.7328</ymax>
</bndbox>
</object>
<object>
<name>prayer</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficut>0</difficut>
<bndbox>
<xmin>562.19525</xmin>
<ymin>211.63783</ymin>
<xmax>615.0446</xmax>
<ymax>300.48517</ymax>
</bndbox>
</object>
<object>
<name>hand</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficut>0</difficut>
<bndbox>
<xmin>384.1045</xmin>
<ymin>285.5305</ymin>
<xmax>456.17444</xmax>
<ymax>355.10376</ymax>
</bndbox>
</object>
`
This is very odd. How many images do you have? Do you think you can create a private project on Roboflow and upload this dataset to correct the issue? It will correct the issue automatically.
No I can't, such dataset is under some restrictions and it is quite large. Training set are 19473 images, validation 6465. I used in other object detection tools with no error, so I'm pretty sure the bbox are correct.The strange thing is the error start from first image to validate. Any hint from this?
Got it. I guess, this will require a bit of manual code checking. So, this is kind of a hassle for you. The best way right now is to write a script to parse through all the XML files. Then divide image_width/xmin, image_height/ymin.
If for any file this comes greater than 1, then there is an issue in that file.
But I am afraid this is mostly a very odd issue on the side of Albumentations. If that is the case, I will need to check it in detail.
Ok, I'll try. But, when you say "then there is an issue in that file.", you mean the annotations are wrong, or there some special case that Albumentations is not able to manage?
From the Albumentations output, it looks like there is an issue in the file. Previously, I had faced this, and there was always an issue in the XML file. This time however, the XML file seems correct. So, this is a bit odd.
For the moment, I simply excluded such file and seems works well. First training and validation loop is finished correctly. Is still running. Cross the finger.
Glad to hear that.
Hi,
using pytorch 1.11 I get:
File "/home/.local/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 396, in fasterrcnn_resnet50_fpn model = FasterRCNN(backbone, num_classes, **kwargs) TypeError: FasterRCNN.__init__() got an unexpected keyword argument 'weights'
Command launch is:
python3.10 fasterrcnn-pytorch-training-pipeline/train.py --data /****/pytorch/test_vari/cat_dogs_pvoc_coord_translated/catdog.yaml --epochs 100 --model fasterrcnn_resnet50_fpn --name catdog --batch 16 --disable-wandb --workers 1
Could you help me please? Thanks.