Faster RCNN init error - Githubissues

unrue commented 1 year ago

Hi,

using pytorch 1.11 I get:

File "/home/.local/lib/python3.10/site-packages/torchvision/models/detection/faster_rcnn.py", line 396, in fasterrcnn_resnet50_fpn model = FasterRCNN(backbone, num_classes, **kwargs) TypeError: FasterRCNN.__init__() got an unexpected keyword argument 'weights'

Command launch is: python3.10 fasterrcnn-pytorch-training-pipeline/train.py --data /****/pytorch/test_vari/cat_dogs_pvoc_coord_translated/catdog.yaml --epochs 100 --model fasterrcnn_resnet50_fpn --name catdog --batch 16 --disable-wandb --workers 1

Could you help me please? Thanks.

sovit-123 commented 1 year ago

Hi @unrue You need to execute the training from within the fasterrcnn-pytorch-training-pipeline directory.

unrue commented 1 year ago

I tried, same error.

sovit-123 commented 1 year ago

Oh I see. You are using PyTorch 1.11. The repository needs PyTorch 1.12 or higher. The weights argument was present in the previous versions. Previously it was pretrained.

unrue commented 1 year ago

Yes, now the run starts, but during the training, on the second epoch I get: pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 435, in check_bbox ValueError: Expected x_min for bbox (tensor(1.1671), tensor(0.3307), tensor(1.), tensor(0.4695), tensor(35)) to be in the range [0.0, 1.0], got 1.1670734882354736. raise ValueError(f"Expected {name} for bbox {bbox} to be in the range [0.0, 1.0], got {value}.")

What does it means? Some wrong bounding boxes (PVOC format) ? How can I retrieve the image name involved?

sovit-123 commented 1 year ago

Yes. Looks like a bounding box issue.

So, it looks like some of the xmin of your dataset is going out of the image. You can upload the dataset to Roboflow and download in the Pascal VOC format. It will automatically correct these issues.

unrue commented 1 year ago

Ok thanks, These are the bounding box getting the error:

`

640

    <height>480</height>
    <depth>3</depth>
</size>
<segmented>0</segmented>
<object>
    <name>head</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficut>0</difficut>
    <bndbox>
        <xmin>81.37453</xmin>
        <ymin>179.94751</ymin>
        <xmax>226.36612</xmax>
        <ymax>288.7328</ymax>
    </bndbox>
</object>
<object>
    <name>prayer</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficut>0</difficut>
    <bndbox>
        <xmin>562.19525</xmin>
        <ymin>211.63783</ymin>
        <xmax>615.0446</xmax>
        <ymax>300.48517</ymax>
    </bndbox>
</object>
<object>
    <name>hand</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficut>0</difficut>
    <bndbox>
        <xmin>384.1045</xmin>
        <ymin>285.5305</ymin>
        <xmax>456.17444</xmax>
        <ymax>355.10376</ymax>
    </bndbox>

`

I don't understand what's wrong.

sovit-123 commented 1 year ago

There does not seem to be anything wrong here. How did you get that this is the file where the boxes are wrong?

unrue commented 1 year ago

Simply I put a print in dataset.py, used 1 GPU, and batch_size=1:

` def load_image_and_labels(self, index):
        image_name = self.all_images[index]
        print("IMAGE: ", image_name)
        image_path = os.path.join(self.images_path, image_name)
`

And I get:

`
Epoch: [0]  [19472/19473]  eta: 0:00:00  lr: 0.001000  loss: 0.6977 (0.9383)  loss_classifier: 0.3446 (0.4838)  loss_box_reg: 0.1704 (0.3233)  loss_objectness: 0.0314 (0.0657)  loss_rpn_box_reg: 0.0203 (0.0655)  time: 0.1073  data: 0.0628  max mem: 1456
Epoch: [0] Total time: 0:34:11 (0.1053 s / it)

.....
IMAGE:  402831.JPG
IMAGE:  40357-H.jpg
IMAGE:  40433-H.jpg
IMAGE:  404371.jpg
IMAGE:  404380.jpg
IMAGE:  40674-H.jpg
IMAGE:  407685.JPG
IMAGE:  408.jpg
Traceback (most recent call last):
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/train.py", line 574, in <module>
    main(args)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/train.py", line 426, in main
    stats, val_pred_image = evaluate(
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 131, in evaluate
    coco = get_coco_api_from_dataset(data_loader.dataset)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/torch_utils/coco_utils.py", line 204, in get_coco_api_from_dataset
    return convert_to_coco_api(dataset)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/torch_utils/coco_utils.py", line 152, in convert_to_coco_api
    img, targets = ds[img_idx]
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/fasterrcnn-pytorch-training-pipeline/fasterrcnn-pytorch-training-pipeline/datasets.py", line 318, in __getitem__
    sample = self.transforms(image=image_resized,
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/composition.py", line 207, in __call__
    p.preprocess(data)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/utils.py", line 83, in preprocess
    data[data_name] = self.check_and_convert(data[data_name], rows, cols, direction="to")
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/utils.py", line 91, in check_and_convert
    return self.convert_to_albumentations(data, rows, cols)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 142, in convert_to_albumentations
    return convert_bboxes_to_albumentations(data, self.params.format, rows, cols, check_validity=True)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 408, in convert_bboxes_to_albumentations
    return [convert_bbox_to_albumentations(bbox, source_format, rows, cols, check_validity) for bbox in bboxes]
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 408, in <listcomp>
    return [convert_bbox_to_albumentations(bbox, source_format, rows, cols, check_validity) for bbox in bboxes]
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 352, in convert_bbox_to_albumentations
    check_bbox(bbox)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/albumentations/core/bbox_utils.py", line 435, in check_bbox
    raise ValueError(f"Expected {name} for bbox {bbox} to be in the range [0.0, 1.0], got {value}.")
ValueError: Expected x_min for bbox (tensor(1.1671), tensor(0.3307), tensor(1.), tensor(0.4695), tensor(35)) to be in the range [0.0, 1.0], got 1.1670734882354736.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 361271) of binary: ****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00019693374633789062 seconds
Traceback (most recent call last):
  File "/leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/python-3.10.8-c7fmxco5bavqi3ye7hrbaxpjpwv6dcxd/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/leonardo/prod/spack/03/install/0.19/linux-rhel8-icelake/gcc-8.5.0/python-3.10.8-c7fmxco5bavqi3ye7hrbaxpjpwv6dcxd/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "****/Deep_Learning/MIC_DL/test_multinode/pytorch_1.12_cu11.6/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

`

sovit-123 commented 1 year ago

Ok. To get the correct image name, you need to make --workers 0 to run the process on the main thread. Else the file names will be jumbled up because of multi threading.

unrue commented 1 year ago

Yes, I forgot to mention I use also --workers 0

sovit-123 commented 1 year ago

If you are on a multi-GPU system, can you pass --device cuda:0? Because I cannot find any issues with the above XML file and generally it should print the erroneous file in this way. I have done it before.

unrue commented 1 year ago

I used export CUDA_VISIBLE_DEVICES=0

The problem seems to start with the evaluation. No idea why.

sovit-123 commented 1 year ago

Oh. So, was the training loop completed successfully? If so, please try --device cuda:0 so that the distributed system variables are not enabled. Do let me know if it prints the correct image path. Else, I will try to add code that will show the erroneous file.

unrue commented 1 year ago

I think the first training epoch is finished, because I read:

Epoch: [0] [19472/19473]

and from error stack trace is present evaluate method. Anyway, I launched as you suggested, it takes some time. Tomorrow I'll post the results. Thanks for the moment.

sovit-123 commented 1 year ago

No problem. I will try to figure out a way to know such files easily.

unrue commented 1 year ago

Hi,

test finished, same error on the same image. THis is the complete xml in case you want to debug:

`

beni_culturali_negative_coords_translated/beni_culturali_orig/dipinto/ritratto/OA_3.00_ICCD0_exp_iccd.021663793507515

<filename>408.jpg</filename>
<path>beni_culturali_negative_coords_translated/beni_culturali_orig/dipinto/ritratto/OA_3.00_ICCD0_exp_iccd.021663793507515/408.jpg</path>
<source>Unknown</source>
<size>
    <width>640</width>
    <height>480</height>
    <depth>3</depth>
</size>
<segmented>0</segmented>
<object>
    <name>head</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficut>0</difficut>
    <bndbox>
        <xmin>81.37453</xmin>
        <ymin>179.94751</ymin>
        <xmax>226.36612</xmax>
        <ymax>288.7328</ymax>
    </bndbox>
</object>
<object>
    <name>prayer</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficut>0</difficut>
    <bndbox>
        <xmin>562.19525</xmin>
        <ymin>211.63783</ymin>
        <xmax>615.0446</xmax>
        <ymax>300.48517</ymax>
    </bndbox>
</object>
<object>
    <name>hand</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficut>0</difficut>
    <bndbox>
        <xmin>384.1045</xmin>
        <ymin>285.5305</ymin>
        <xmax>456.17444</xmax>
        <ymax>355.10376</ymax>
    </bndbox>
</object>

`

sovit-123 commented 1 year ago

This is very odd. How many images do you have? Do you think you can create a private project on Roboflow and upload this dataset to correct the issue? It will correct the issue automatically.

unrue commented 1 year ago

No I can't, such dataset is under some restrictions and it is quite large. Training set are 19473 images, validation 6465. I used in other object detection tools with no error, so I'm pretty sure the bbox are correct.The strange thing is the error start from first image to validate. Any hint from this?

sovit-123 commented 1 year ago

Got it. I guess, this will require a bit of manual code checking. So, this is kind of a hassle for you. The best way right now is to write a script to parse through all the XML files. Then divide image_width/xmin, image_height/ymin.

If for any file this comes greater than 1, then there is an issue in that file.

But I am afraid this is mostly a very odd issue on the side of Albumentations. If that is the case, I will need to check it in detail.

unrue commented 1 year ago

Ok, I'll try. But, when you say "then there is an issue in that file.", you mean the annotations are wrong, or there some special case that Albumentations is not able to manage?

sovit-123 commented 1 year ago

From the Albumentations output, it looks like there is an issue in the file. Previously, I had faced this, and there was always an issue in the XML file. This time however, the XML file seems correct. So, this is a bit odd.

unrue commented 1 year ago

For the moment, I simply excluded such file and seems works well. First training and validation loop is finished correctly. Is still running. Cross the finger.

sovit-123 commented 1 year ago

Glad to hear that.

sovit-123 / fasterrcnn-pytorch-training-pipeline

Faster RCNN init error #104