johschmidt42 / PyTorch-Object-Detection-Faster-RCNN-Tutorial

76 stars 34 forks source link

Only AP > 0 for one class out of 36 #4

Closed ozzyou closed 2 years ago

ozzyou commented 3 years ago

Dear John, Thanks for the very clear and nice tutorial. I am currently trying to implement Faster-RCNN on ActionGenome using your implementation. This dataset has 36 classes. I managed to get the model to train by setting the right parameters in training_script.py, however I'm only getting an AP > 0 for just one class, the "person" class (around 0.55). The dataset is a bit unbalanced towards that class, but not severely. I've checked the ground truth annotations, how they look before being passed to the model and everything seems to be fine. Since your tutorial was just for one class, I was wondering if you ever experienced similar issues. If yes, how did you solve them? Thanks!

johschmidt42 commented 3 years ago

Hey, thanks for reaching out. I've trained some models using this repo which had 1-5 classes without problems. Although I must admit that an underrepresented class is more difficult to train. It's certainly strange that only 1 class performs well while the other classes have an AP of around 0. Could you share some details regarding your training script? What sizes are your images and what's the backbone you are using. What is the average size of the bounding box for each class? What the class distribution of your training dataset?

ozzyou commented 3 years ago

Hi, thanks for the quick reply. The dataset is ActionGenome, in which the (min_x, min_y, max_x, max_y) sizes of the images are (266, 264, 480, 480). Below are the parameters I'm using. When trying to overfit on a small subset of 309 train images I get the same behaviour. The class distribution is given below the parameters.

params = {
    "BATCH_SIZE": 32,
    "OWNER": "ozzyou",
    "SAVE_DIR": None,  # checkpoints will be saved to cwd
    "LOG_MODEL": False,  # whether to log the model to neptune after training
    "GPU": 1,  # set to None for cpu training
    "LR": 0.001,
    "PRECISION": 32,
    "CLASSES": 37,
    "SEED": 42,
    "PROJECT": "fastrcnn",
    "EXPERIMENT": "1",
    "MAXEPOCHS": 200,
    "PATIENCE": 200,
    "BACKBONE": "resnet34",
    "FPN": False,
    "ANCHOR_SIZE": ((15, 30, 60, 120, 240),),
    "ASPECT_RATIOS": ((0.5, 1.0, 2.0),),
    "MIN_SIZE": 264,
    "MAX_SIZE": 480,
    "IMG_MEAN": [0.34626717, 0.37275087, 0.41505611],
    "IMG_STD": [0.12789275, 0.12214862, 0.12850415],
    "IOU_THRESHOLD": 0.5,
}

{'bag': 0.0227218021480478,
 'person': 0.29790586227269744,
 'bed': 0.019491176453035573,
 'blanket': 0.02294858252319028,
 'book': 0.027657259259701326,
 'box': 0.016206839616542235,
 'broom': 0.010953094259074696,
 'chair': 0.04322354378154286,
 'closet/cabinet': 0.022584540342040507,
 'clothes': 0.029483438070059223,
 'cup/glass/bottle': 0.04440717819566372,
 'dish': 0.029493384577740908,
 'door': 0.024520130736896968,
 'doorknob': 0.0064015723439343215,
 'doorway': 0.021591878875408057,
 'floor': 0.03296272645711364,
 'food': 0.05258121820847482,
 'groceries': 0.0061389845411377614,
 'laptop': 0.020008394852483343,
 'light': 0.0026238887264292635,
 'medicine': 0.00564762706166238,
 'mirror': 0.010996858892874124,
 'paper/notebook': 0.017080142990994432,
 'phone/camera': 0.02612748637825773,
 'picture': 0.006648245734440181,
 'pillow': 0.014219527381740997,
 'refrigerator': 0.0062444175225636524,
 'sandwich': 0.01793554265161959,
 'shelf': 0.015765214675475293,
 'shoe': 0.011402676406286989,
 'sofa/couch': 0.021625697001525793,
 'table': 0.05414281991449982,
 'television': 0.00545068620956496,
 'towel': 0.021575964463117357,
 'vacuum': 0.0045196930905589735,
 'window': 0.0067119033836029835}
johschmidt42 commented 3 years ago

Did you check the anchor boxes that are created with the AnchorViewer? If yes, could you compare them to your actual bounding boxes? Try setting fpn to True and try to find adequate anchor boxes for every level. The rest looks good to me. Are the images already scaled or normalized (before they are normalized here)?

ozzyou commented 3 years ago

Unfortunately I could not visualize in napari, but I used matplotlib on the generated anchors. The first image is what I used for training: anchor_size=(15, 30, 60, 120, 240)/aspect_ratios=(0.5, 1.0, 2.0), while the second picture is with anchor_size=(60, 120, 240)/aspect_ratios=(0.5, 1.0, 2.0). It gives the same 0 AP. Seems like something is off with the anchor points. The images are not scaled or normalized. image image

johschmidt42 commented 3 years ago

I think its an issue with the anchor values. Can you try to use FPN=True and then something like anchor_sizes=((240,), (120,), (60,), (30,), (15,)) - assuming you have 4 or 5 layers that are used to create the anchor boxes from the fpn. Or go super overkill with ((240, 120, 60, 30, 15),) * 4 to be really sure that the algorithm creates anchor boxes with match well with the acutal bounding boxes. Please note that the arg anchor_sizes needs to be of type Tuple[Tuple[int]]. Without FPN you basically take the last layer in the CNN decoder (resnet34) to map the anchor positions onto your images. I'm not sure that with images of size 480x480 (h, w) and a resnet34 you'll get the anchor boxes that are shown in the images you posted (I could be wrong!)

Also, because of the small size of images and probably small objects it is really important to have good anchor boxes. Without them, the object detector performs poorly in my experience. Let me know if anything is unclear to you.

johschmidt42 commented 2 years ago

Closing this issue as it was resolved.