Training - IndexError: Target 1 is out of bounds.

taniabuzykina commented 2 years ago

Hi John,

Thank you very much for your detailed and comprehensive code and description on medium.com. I'm using it for damage detection on wind turbine blades and during the training stage I encountered the following issue:

File ~\Documents\***\***\venv\lib\site-packages\torch\nn\functional.py:2846, in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   2844 if size_average is not None or reduce is not None:
   2845     reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2846 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)

IndexError: Target 1 is out of bounds.

I was running the following training code chunk:

trainer.fit(
    model=task, train_dataloader=dataloader_train, val_dataloaders=dataloader_valid
)

In Jupyter Notebook.

Could you please suggest a possible solution or do you perhaps know the reason why this error could occur?

Kind regards, Tetiana Buzykina

johschmidt42 commented 2 years ago

Hi, it looks like that the target's shape is incorrect or the values are out of bonds, could you provide a sample of your targets before and after pre-processing?

taniabuzykina commented 2 years ago

Hi, my targets looked like this before passing them to the training code:

{"labels": ["damage", "damage"], "boxes": [[523, 302, 572, 343], [112, 280, 204, 335]]}

And then my targets list before training is just a bunch of directories:

After splitting

johschmidt42 commented 2 years ago

Could you reduce the batch size to 1? Does this error occur with every input-target pair or with a specific one?

If we can't pinpoint it to a single target, I'd suggest to take closer look at the data & code together. Feel free to email me pieces of code I should take a look on. If you can provide a minimal example with e.g. input-target pairs & the pre-processing pipline, this would be great. @taniabuzykina

best, Johannes

taniabuzykina commented 2 years ago

Hi Johannes,

I just tried reducing the batch size to 1 but unfortunately, the same issue occurs. Please see the images below with the target labels, I will provide two examples here.

Image 1, 004.png 004

Label 1, 004.json {"labels": ["damage", "damage"], "boxes": [[523, 302, 572, 343], [112, 280, 204, 335]]}

Image 2, 005.png 005

Label 2, 005.json {"labels": ["damage", "damage", "damage", "damage", "damage"], "boxes": [[12, 305, 55, 342], [98, 325, 133, 365], [154, 342, 195, 371], [257, 357, 291, 371], [79, 270, 113, 292]]}

When I tried debugging the train_scipt.py code with a breakpoint before train_dataloader = dataloader_train, I could check the input and target PyTorch tensors (batch size 1) before it produces an error:

As you can see by the label values those are for images 004.png and 005.png before they are passed to train_dataloader.

As I step into train_dataloader = dataloader_train and it produces an error, I can see the following values for inputs:

And the targets are set to 0s:

I guess I sort of understand why the targets are set to 0s but yeah... These are all my variables at this step:

The code that produces the error is right after

    # trainer init
    trainer = Trainer(
        gpus=params.GPU,
        precision=params.PRECISION,  # try 16 with enable_pl_optimizer=False
        callbacks=[checkpoint_callback, learningrate_callback, early_stopping_callback],
        default_root_dir=save_dir,  # where checkpoints are saved to
        logger=neptune_logger,
        log_every_n_steps=1,
        num_sanity_val_steps=0,
        max_epochs=params.MAXEPOCHS,
    )

This bit:

    # start training
    trainer.fit(
        model=task,
        train_dataloader=dataloader_train,
        val_dataloaders=dataloader_valid
    )

I believe that in total I changed three code chunks, two about constructing lists of inputs and targets used for training, testing and validating, but I triple-checked everything and they're fine.

 # root directory
    root = ROOT_PATH

    # input and target files
    inputs = get_filenames_of_path(root / 'Network Train' / 'Images')
    targets = get_filenames_of_path(root / 'Network Train' / 'Labels')

    inputs.sort()
    targets.sort()

You can see how I split my test and training sets, I tried shuffling them a bit because the images are slightly different as well - some are PNG and some are JPG formats:

 # test transformations
    transforms_test = ComposeDouble(
        [
            Clip(),
            FunctionWrapperDouble(np.moveaxis, source=-1, destination=0),
            FunctionWrapperDouble(normalize_01),
        ]
    )

    # random seed
    seed_everything(params.SEED)

    # IMPORTANT: TRAINING 10 + 8, VALIDATION 3 + 3, TESTING 3 + 3

    # training: inputs[1-10]
    # inputs[:10, 16:18]

    train = inputs[:10]
    train.extend(inputs[16:24])
    # print("train inputs")
    # print(train)

    t_train = targets[:10]
    t_train.extend(targets[16:24])

    validate = inputs[10:13]
    validate.extend(inputs[24:27])

    t_validate = targets[10:13]
    t_validate.extend(targets[24:27])

    test = inputs[13:16]
    test.extend(inputs[27:30])

    t_test = targets[13:16]
    t_test.extend(targets[27:30])

    inputs_train, inputs_valid, inputs_test = train, validate, test
    targets_train, targets_valid, targets_test = t_train, t_validate, t_test

The other thing I changed is Params, added my name and Neptune project name, those things:

# hyper-parameters
@dataclass
class Params:
    BATCH_SIZE: int = 1
    OWNER: str = "taniabuzykina"  # set your name here, e.g. johndoe22
    SAVE_DIR: Optional[
        str
    ] = None  # checkpoints will be saved to cwd (current working directory)
    LOG_MODEL: bool = False  # whether to log the model to neptune after training
    GPU: Optional[int] = None  # set to None for cpu training
    LR: float = 0.001
    PRECISION: int = 32
    CLASSES: int = 1
    SEED: int = 42
    PROJECT: str = "WindBladesRCNN"
    EXPERIMENT: str = "test1"
    MAXEPOCHS: int = 500
    PATIENCE: int = 50
    BACKBONE: ResNetBackbones = ResNetBackbones.RESNET34
    FPN: bool = False
    ANCHOR_SIZE: Tuple[Tuple[int, ...], ...] = ((32, 64, 128, 256, 512),)
    ASPECT_RATIOS: Tuple[Tuple[float, ...]] = ((0.5, 1.0, 2.0),)
    MIN_SIZE: int = 1024
    MAX_SIZE: int = 1025
    IMG_MEAN: List = field(default_factory=lambda: [0.485, 0.456, 0.406])
    IMG_STD: List = field(default_factory=lambda: [0.229, 0.224, 0.225])
    IOU_THRESHOLD: float = 0.5

Initially, I didn't want to touch this bit:

MIN_SIZE: int = 1024
MAX_SIZE: int = 1025

but then tried experimenting and changing the values to 400 and 600 respectively, produces the same error regardless. The screenshots I provided above are with the error generated with your original values, 1024 and 1025.

I'm sorry that must be a lot of info, just wanted to provide as many details as possible. Once again thank you so much for your help with this issue.

Kind regards, Tetiana

johschmidt42 commented 2 years ago

Hi, thank you for this detailed response. I'll try to reproduce it and will come back to you. Are these two images both supposed to be 586 × 371 in size or was it downscaled by github? @taniabuzykina

johschmidt42 commented 2 years ago

Hi @taniabuzykina I could successfully train with the images and bounding boxes you provided me with. I will share the code later. Will now investigate why it failed on your end.

taniabuzykina commented 2 years ago

Hi @johschmidt42 my apologies for not replying sooner I've been very busy and I'm afraid I will be even busier for the next couple of days, after that I will be looking deeper into this matter. I will also try my best to look into your solution as soon as you upload it. I'm endlessly grateful for your help!

taniabuzykina commented 2 years ago

Also, 586x371 is the original resolution of the images. Thank you once again for your help!

johschmidt42 commented 2 years ago

Hi @taniabuzykina, I found the issue: CLASSES: int = 1 should be CLASSES: int = 2

I know this is counter intuitive from a configuration perspective but this is due to the fact that the network tries to generate an equal amount of negative and positive examples (existing bounding box vs non existing bounding box) that are compared to the "default boxes".

best, Johannes

johschmidt42 commented 2 years ago

Minimal example:

# imports
import os
import pathlib
from dataclasses import dataclass, field
from typing import List, Optional, Tuple

import numpy as np
from pytorch_lightning import Trainer, seed_everything
from torch.utils.data import DataLoader

from pytorch_faster_rcnn_tutorial.backbone_resnet import ResNetBackbones
from pytorch_faster_rcnn_tutorial.datasets import ObjectDetectionDataSet
from pytorch_faster_rcnn_tutorial.faster_RCNN import (FasterRCNNLightning,
                                                      get_faster_rcnn_resnet)
from pytorch_faster_rcnn_tutorial.transformations import (
    Clip, ComposeDouble, FunctionWrapperDouble, normalize_01)
from pytorch_faster_rcnn_tutorial.utils import (collate_double,
                                                get_filenames_of_path)

# hyper-parameters
@dataclass
class Params:
    BATCH_SIZE: int = 2
    OWNER: str = "your_name"  # set your name here, e.g. johndoe22
    SAVE_DIR: Optional[
        str
    ] = None  # checkpoints will be saved to cwd (current working directory)
    LOG_MODEL: bool = False  # whether to log the model to neptune after training
    GPU: Optional[int] = None  # set to None for cpu training
    LR: float = 0.001
    PRECISION: int = 32
    CLASSES: int = 2
    SEED: int = 42
    PROJECT: str = "Project"
    EXPERIMENT: str = "project"
    MAXEPOCHS: int = 500
    PATIENCE: int = 50
    BACKBONE: ResNetBackbones = ResNetBackbones.RESNET34
    FPN: bool = False
    ANCHOR_SIZE: Tuple[Tuple[int, ...], ...] = ((32, 64, 128, 256, 512),)
    ASPECT_RATIOS: Tuple[Tuple[float, ...]] = ((0.5, 1.0, 2.0),)
    MIN_SIZE: int = 1024
    MAX_SIZE: int = 1025
    IMG_MEAN: List = field(default_factory=lambda: [0.485, 0.456, 0.406])
    IMG_STD: List = field(default_factory=lambda: [0.229, 0.224, 0.225])
    IOU_THRESHOLD: float = 0.5

def main():
    params = Params()

    # save directory
    save_dir = os.getcwd() if not params.SAVE_DIR else params.SAVE_DIR

    # root directory
    root = pathlib.Path("/Users/johannes/Documents/test")

    # input and target files
    inputs = get_filenames_of_path(root / "inputs")
    targets = get_filenames_of_path(root / "targets")

    inputs.sort()
    targets.sort()

    # mapping
    mapping = {
        "damage": 1,
    }

    # training transformations and augmentations
    transforms_training = ComposeDouble(
        [
            Clip(),
            FunctionWrapperDouble(np.moveaxis, source=-1, destination=0),
            FunctionWrapperDouble(normalize_01),
        ]
    )

    # random seed
    seed_everything(params.SEED)

    # training validation test split
    inputs_train = inputs
    targets_train = targets

    # dataset training
    dataset_train = ObjectDetectionDataSet(
        inputs=inputs_train,
        targets=targets_train,
        transform=transforms_training,
        use_cache=True,
        convert_to_format=None,
        mapping=mapping,
    )

    # dataloader training
    dataloader_train = DataLoader(
        dataset=dataset_train,
        batch_size=params.BATCH_SIZE,
        shuffle=True,
        num_workers=0,
        collate_fn=collate_double,
    )

    # model init
    model = get_faster_rcnn_resnet(
        num_classes=params.CLASSES,
        backbone_name=params.BACKBONE,
        anchor_size=params.ANCHOR_SIZE,
        aspect_ratios=params.ASPECT_RATIOS,
        fpn=params.FPN,
        min_size=params.MIN_SIZE,
        max_size=params.MAX_SIZE,
    )

    # lightning init
    task = FasterRCNNLightning(
        model=model, lr=params.LR, iou_threshold=params.IOU_THRESHOLD
    )

    # trainer init
    trainer = Trainer(
        gpus=params.GPU,
        precision=params.PRECISION,  # try 16 with enable_pl_optimizer=False
        default_root_dir=save_dir,  # where checkpoints are saved to
        logger=False,
        num_sanity_val_steps=0,
        max_epochs=params.MAXEPOCHS,
    )

    # start training
    trainer.fit(
        model=task, train_dataloader=dataloader_train, val_dataloaders=dataloader_train
    )

    print("Finished")

if __name__ == "__main__":
    main()

taniabuzykina commented 2 years ago

Hi @johschmidt42 I see! Understood now, thank you! Apologies for taking your time because of such a minute stupid mistake!

johschmidt42 commented 2 years ago

No worries, I'm happy to help! If there's anything else, feel free to let me know, cheers

johschmidt42 / PyTorch-Object-Detection-Faster-RCNN-Tutorial

Training - IndexError: Target 1 is out of bounds. #7