detection result is weird

ardeal commented 3 years ago

Hi,

I customized the yaml to be trained for my 4 classes. I firstly record long time video(the camera is hung on the ceiling to monitor the cubic of work place) and then save each frame as image. The saved images are annotated for training. The background of the images are very simple, so I totally annotated more than 4000 images and trained the network for 120 epochs.

with the trained weights file, I run detect.py file. For some images, the detection result are very good(those objects that should be detected are all well detected.), but for some other images, no objects are detected. Such as, for 331st frame, the detection result is very good, but for 332nd frame, no objects are detected at all.

For 331st and 332nd images that are continuous and very very similar, but the detection result is greatly different.

Thanks and Best Regards, Ardeal

glenn-jocher commented 3 years ago

A picture is worth a thousand words...

ardeal commented 3 years ago

@glenn-jocher 👍 ,

Could I send the picture to your email?

glenn-jocher commented 3 years ago

@ardeal oh interesting. I don't have any specific advice, but the general advice in all cases applies here: train longer (300-1000 epochs for smaller dataset), use a larger model if smaller models are underperforming, use a larger --img-size, evolve hyps etc.

ardeal commented 3 years ago

@glenn-jocher , Thank you!

I did 2 experiments: 1) train 20 epochs. mAP during training is 0.995 and 0.6 2) train 120 epochs. mAP during training is 0.995 and 0.87.

it is interesting that: the detection based on the weights from experiment 2) missed more images than experiment 1).

From my image, you might know that my environment is very simple, so the the training images and epochs needed should be much less than that needed for COCO.

Furthermore, the resolution of my training images are all 1920X1080. all images are captured by camera API directly, which means that my images are all well aligned. light, resolution, camera height and etc. are all unchanged.

during detection, I noticed that, once the object is detected the confidence in pred tensor is bigger than 0.9 which is very good. however, for other very similar image, the confidence is very small. that is really weird..................

one more question: the images of my dataset is 1080P. the input img-size is set to: training by default: parser.add_argument('--img-size', nargs='+', type=int, default=[640, 640], help='[train, test] image sizes') detection by default: parser.add_argument('--img-size', type=int, default=640, help='inference size (pixels)') [640, 640] ,

will the setting affect the detection result?

glenn-jocher commented 3 years ago

@ardeal you should train and detect at identical --img-size for best results. You probably want a validation set that's more representative of the real world use case so your mAP correlates better with deployed results.

ardeal commented 3 years ago

@glenn-jocher ,

during training the --img-size is [640,640], during detection, the --img-size is 640. that should be identical.

the images for training, validation and detection are all from the same use case/same camera, everything is the same. that are almost the same environment.

ardeal commented 3 years ago

@glenn-jocher ,

For the issue in this post, I found a very interesting problem: we captured many frames from video, and many images are continuously captured from video. So we annotated some and ignore some continuous frames.

During the test/detection, the images that are annotated are well detected, but those images that are not annotated are not well detected(all objects on them can't be detected).

As the background of those images are very simple, the training is easy to overfit.

For the code in this repo, is there any method to solve the ovefitting issue?

glenn-jocher commented 3 years ago

@ardeal its up to you to research solutions for your dataset.

urbansound8K commented 3 years ago

@ardeal hey there,

I am going through your problem. were you able to figure it out?

ardeal commented 3 years ago

@glenn-jocher ， Thanks for you help! I have figured out the problem. The root reason is that: I input some image with objects but which are not annotated to the network for training. Once I removed those image with objects but which are not annotated and re-train the network, everything is fine.

glenn-jocher commented 3 years ago

@ardeal ah got it! Yes its important to make sure all objects are annotated in your dataset, otherwise your results may be unreliable as the model will not learn properly.

ardeal commented 3 years ago

@glenn-jocher , One thing that I cannot understand is that about negative and positive samples during training: According to my understanding, for each anchor box in YOLOv3, there are 85(4 xywh, objectness, 80 classes confidence) output. How YOLOv3 network handle negative and positive samples?

glenn-jocher commented 3 years ago

@ardeal negative samples incur objectness loss only. Positive samples incur all losses.

ardeal commented 3 years ago

@glenn-jocher , Hi,

For the following function, if targets tensor is empty, I have questions: 1) offset tensor is not defined and the line gij = (gxy - offsets).long() might be wrong? 2) if targets tensor is empty, everything in indices, tbox, tcls is empty, how to calculate loss in compute_loss function? 3) how will the code handle negative sampes? I fell like that the code will not handle negative sample. I don't know whether my understanding is correct.

def build_targets(p, targets, model):
    # Build targets for compute_loss(), input targets(image,class,x,y,w,h)
    det = model.module.model[-1] if is_parallel(model) else model.model[-1]  # Detect() module
    na, nt = det.na, targets.shape[0]  # number of anchors, targets
    tcls, tbox, indices, anch = [], [], [], []
    gain = torch.ones(7, device=targets.device)  # normalized to gridspace gain
    ai = torch.arange(na, device=targets.device).float().view(na, 1).repeat(1, nt)  # same as .repeat_interleave(nt)
    targets = torch.cat((targets.repeat(na, 1, 1), ai[:, :, None]), 2)  # append anchor indices

    g = 0.5  # bias
    off = torch.tensor([[0, 0],
                        # [1, 0], [0, 1], [-1, 0], [0, -1],  # j,k,l,m
                        # [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm
                        ], device=targets.device).float() * g  # offsets

    for i in range(det.nl):
        anchors = det.anchors[i]
        gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain

        # Match targets to anchors
        t = targets * gain
        if nt:
            # Matches
            r = t[:, :, 4:6] / anchors[:, None]  # wh ratio
            j = torch.max(r, 1. / r).max(2)[0] < model.hyp['anchor_t']  # compare
            # j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n)=wh_iou(anchors(3,2), gwh(n,2))
            t = t[j]  # filter

            # Offsets
            gxy = t[:, 2:4]  # grid xy
            gxi = gain[[2, 3]] - gxy  # inverse
            j, k = ((gxy % 1. < g) & (gxy > 1.)).T
            l, m = ((gxi % 1. < g) & (gxi > 1.)).T
            j = torch.stack((torch.ones_like(j),))
            t = t.repeat((off.shape[0], 1, 1))[j]
            offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
        else:
            t = targets[0]
            offsets = 0

        # Define
        b, c = t[:, :2].long().T  # image, class
        gxy = t[:, 2:4]  # grid xy
        gwh = t[:, 4:6]  # grid wh
        gij = (gxy - offsets).long()
        gi, gj = gij.T  # grid xy indices

        # Append
        a = t[:, 6].long()  # anchor indices
        indices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))  # image, anchor, grid indices
        tbox.append(torch.cat((gxy - gij, gwh), 1))  # box
        anch.append(anchors[a])  # anchors
        tcls.append(c)  # class

    return tcls, tbox, indices, anch

glenn-jocher commented 3 years ago

@ardeal target-building (the code you pasted) works correctly. Targets incur all losses (obj, cls, box), empty areas (no targets) incur only obj loss.

Background/empty images are handled correctly.

ardeal commented 3 years ago

@glenn-jocher , Thanks for you reply!

empty areas (no targets) incur only obj loss. how and where the code do this? could you please help to point it out?

in build_targets function, if nt == 0, t=targets[0], this means t is empty, as targets is empty.

For example, there are 5 boxes labelled as targets in one image, how is the rest part of the image handled as negative sample? where is the code?

For example, there is no boxes labelled as targets in one image, how is the image handled as negative sample? where is the code. For build_targets function, targets is empty, so nothing will be done, right?

my doubt is: how are backgound/empty image handled?

ardeal commented 3 years ago

The following labels are copied from the label file of coco:

Does label 0 mean negative sample/box?

27 0.530258 0.813260 0.102234 0.373479   
0 0.110391 0.766292 0.220781 0.462917   
0 0.806555 0.804365 0.152672 0.391271   
0 0.214297 0.697792 0.143281 0.355458   
0 0.942922 0.693979 0.114156 0.260167   
0 0.075586 0.405979 0.021984 0.092333   
0 0.304430 0.643573 0.074984 0.098646   
0 0.705258 0.586000 0.067703 0.100375   
0 0.641781 0.532667 0.050875 0.055583   
0 0.939250 0.887135 0.120687 0.225062   
24 0.663227 0.777552 0.120516 0.444896   
0 0.507852 0.580938 0.575109 0.811667   
0 0.815102 0.572406 0.013141 0.020063   
0 0.865328 0.714500 0.115312 0.304125

ardeal commented 3 years ago

@glenn-jocher

I reviewed the function cache_labels, the else branch means if nothing in label file, the image is treated as negative and it is labelled as 0.

I have 2 doubts: 1) there are 80 classes in coco dataset, the label starts from 0 to 79. it means that 0 is positive but not negative. 2) for the else branch in the cache_labels function, empty labels file is treated as negative, but unlabelled region of labelled images are not treated as negative. is my understanding correct?

    def cache_labels(self, path=Path('./labels.cache')):
        # Cache dataset labels, check images and read shapes
        x = {}  # dict
        nm, nf, ne, nc = 0, 0, 0, 0  # number missing, found, empty, duplicate
        pbar = tqdm(zip(self.img_files, self.label_files), desc='Scanning images', total=len(self.img_files))
        for i, (im_file, lb_file) in enumerate(pbar):
            try:
                # verify images
                im = Image.open(im_file)
                im.verify()  # PIL verify
                shape = exif_size(im)  # image size
                assert (shape[0] > 9) & (shape[1] > 9), 'image size <10 pixels'

                # verify labels
                if os.path.isfile(lb_file):
                    nf += 1  # label found
                    with open(lb_file, 'r') as f:
                        l = np.array([x.split() for x in f.read().strip().splitlines()], dtype=np.float32)  # labels
                    if len(l):
                        assert l.shape[1] == 5, 'labels require 5 columns each'
                        assert (l >= 0).all(), 'negative labels'
                        assert (l[:, 1:] <= 1).all(), 'non-normalized or out of bounds coordinate labels'
                        # assert np.unique(l, axis=0).shape[0] == l.shape[0], 'duplicate labels'
                    else:
                        ne += 1  # label empty
                        l = np.zeros((0, 5), dtype=np.float32)
                else:
                    nm += 1  # label missing
                    l = np.zeros((0, 5), dtype=np.float32)
                x[im_file] = [l, shape]
            except Exception as e:
                nc += 1
                print('WARNING: Ignoring corrupted image and/or label %s: %s' % (im_file, e))

            pbar.desc = f"Scanning '{path.parent / path.stem}' for images and labels... " \
                        f"{nf} found, {nm} missing, {ne} empty, {nc} corrupted"

        if nf == 0:
            print(f'WARNING: No labels found in {path}. See {help_url}')

        x['hash'] = get_hash(self.label_files + self.img_files)
        x['results'] = [nf, nm, ne, nc, i + 1]
        torch.save(x, path)  # save for next time
        logging.info(f"New cache created: {path}")
        return x

glenn-jocher commented 3 years ago

@ardeal all aspects of caching, target building, and loss computation are handled correctly, irrespective of image contents. All relevant code is in loss.py.

ardeal commented 3 years ago

@glenn-jocher ,

Thank you for your reply!

I really cannot understand: 1) in cache_labels function:

                    if len(l):
                        assert l.shape[1] == 5, 'labels require 5 columns each'
                        assert (l >= 0).all(), 'negative labels'
                        assert (l[:, 1:] <= 1).all(), 'non-normalized or out of bounds coordinate labels'
                        # assert np.unique(l, axis=0).shape[0] == l.shape[0], 'duplicate labels'
                    else:
                        ne += 1  # label empty
                        l = np.zeros((0, 5), dtype=np.float32)

Does it mean that empty image are cached as 0 ? however in label file there is label = 0. will they conflict?

2) in build_targets function, if nt == 0, t=targets[0], this means t is empty, as targets is empty. is my understanding correct?

        if nt:
            # Matches
            r = t[:, :, 4:6] / anchors[:, None]  # wh ratio
            j = torch.max(r, 1. / r).max(2)[0] < model.hyp['anchor_t']  # compare
            # j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n)=wh_iou(anchors(3,2), gwh(n,2))
            t = t[j]  # filter

            # Offsets
            gxy = t[:, 2:4]  # grid xy
            gxi = gain[[2, 3]] - gxy  # inverse
            j, k = ((gxy % 1. < g) & (gxy > 1.)).T
            l, m = ((gxi % 1. < g) & (gxi > 1.)).T
            j = torch.stack((torch.ones_like(j),))
            t = t.repeat((off.shape[0], 1, 1))[j]
            offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]
        else:
            t = targets[0]
            offsets = 0

glenn-jocher commented 3 years ago

@ardeal there are no conflicts, code works correctly for caching and target building. I don't have time to walk you through the code, I'm sorry.

ardeal commented 3 years ago

@glenn-jocher ,

Sorry to waste your time for my questions! I would like to express my understanding so that you could take much less time to tell or answer.

The following code is copied from compute_loss function. The code does the following 2 steps to compute loss: step1: ps = pi[b, a, gj, gi] choose the cell/box from map of output layer, and the cell/box is near to labelled box. step2: compute loss.

Obviously, the code only compute loss where the box is labelled. for those boxes that are not labelled, the code doesn't take any action. during training, only those labelled region/box are trained. the weights of the network will converge to objectness. Namely, for objectness region/box, the network converge to smaller loss.

however, how does the code handle negative sample/region/cell/box? During inference/forward, for each cell and anchor box, the code outputs a vector with length 80, the component objectness in the output vector is used to judge whether there is an object. If objectness is greater than the threshold, there is an object. otherwise, there is no.

    for i, pi in enumerate(p):  # layer index, layer predictions
        b, a, gj, gi = indices[i]  # image, anchor, gridy, gridx
        tobj = torch.zeros_like(pi[..., 0], device=device)  # target obj
        n = b.shape[0]  # number of targets
        if n:
            nt += n  # cumulative targets
            ps = pi[b, a, gj, gi]  # prediction subset corresponding to targets

            # Regression
            pxy = ps[:, :2].sigmoid() * 2. - 0.5
            pwh = (ps[:, 2:4].sigmoid() * 2) ** 2 * anchors[i]
            pbox = torch.cat((pxy, pwh), 1).to(device)  # predicted box
            iou = bbox_iou(pbox.T, tbox[i], x1y1x2y2=False, CIoU=True)  # iou(prediction, target)
            lbox += (1.0 - iou).mean()  # iou loss

            # Objectness
            tobj[b, a, gj, gi] = (1.0 - model.gr) + model.gr * iou.detach().clamp(0).type(tobj.dtype)  # iou ratio

            # Classification
            if model.nc > 1:  # cls loss (only if multiple classes)
                t = torch.full_like(ps[:, 5:], cn, device=device)  # targets
                t[range(n), tcls[i]] = cp
                lcls += BCEcls(ps[:, 5:], t)  # BCE

            # Append targets to text file
            # with open('targets.txt', 'a') as file:
            #     [file.write('%11.5g ' * 4 % tuple(x) + '\n') for x in torch.cat((txy[i], twh[i]), 1)]

        lobj += BCEobj(pi[..., 4], tobj) * balance[i]  # obj loss

glenn-jocher commented 3 years ago

@ardeal I've already answered your question in my previous post.

@ardeal target-building (the code you pasted) works correctly. Targets incur all losses (obj, cls, box), empty areas (no targets) incur only obj loss.

Background/empty images are handled correctly.

ardeal commented 3 years ago

For this issue, I eventually figured out the reason: Such as, Yolo divides the map of the final layer to 13*13 cells. some cells are corresponding to positive labels/boxes, the others are corresponding to unlabelled boxes which are treated as negative labels/boxes.

It means that: in one image, the labelled boxes is treated as positive boxes, and unlabelled boxes is treated as negative boxes.

Namely, those unlabelled region in the image should not contained positive targets. Those are trained as negative.

Namely, as long as positive targets exist in image, they should be labeled correctly.

glenn-jocher commented 3 years ago

@ardeal yes, all instances must be labelled in your training data. If you label 1 person but not another in your images, then training will not work very well.

ardeal commented 3 years ago

:grinning::grinning::grinning: 👍👍👍

glenn-jocher commented 11 months ago

@ardeal 🙂👍 Always happy to help! If you have any more questions or need further assistance, feel free to ask!

ultralytics / yolov3

detection result is weird #1648