Maybe I found a bug in the source code

julingers commented 2 years ago

Hi, @joshi-bharat. it looks like that I found a bug in the test code.

For test_images_list.py, when you defined compute graph, you called pose_loss.predict() function (see code) to get filted x,y,conf, selected values. It looks fine. However, you used if-else structure (see code) to control data flow in the pose_loss.predict() function. In fact, compute graph is static in tensorflow1.0 . So the program always executes only the else branch (see code), regardless of whether bbox_mask is None. So bbox_masks don't work in fact.

By the way, the conf here does not use the sigmoid fuction. So the conf isn't in 0~1, that results validmask to always be True.

joshi-bharat commented 2 years ago

Actually, the bug was at https://github.com/joshi-bharat/deep_underwater_localization/blob/faa65474cb9aa71dfd806bd624831b2f97b2b794/utils/misc_utils.py#L166. I might have added that typo later during refactor because I used the predict bounding box mask to plot images in the paper. Also, added sigmoid in conf.

Can you check if the issue is fixed or not? I actually do not have a computer set up for this.

julingers commented 2 years ago

@joshi-bharat , I debugged the code, and the bug is not at 'if box is None'.

I changed the get_box_mask() function as follows. As you say, now the return value bbox_masks should not be alway None theoretically. Actually, it is None at first. For the first time it executes, it chooses the else branch. And after that every time it executes the else branch when feeds data even bbox_masks is not None. For you have defined compute graph, it cann't be changed even you feed different data. Cause that tensorflow1.0 is static graph, it cann't be updated.

At last, I haven't found a solution yet. I tried tf.cond() to control data flow, but met some errors. Maybe need to change framework to dynamic graph like pytorch or tensorflow2.0, but it's a lot of work.

def get_bbox_mask(bbox, img_size=(416, 416)):

    if bbox is None:
        return None

    # 新建与13，26，52对应特征图的维度的全0张量。
    y_true_13 = np.zeros((img_size[1] // 32, img_size[0] // 32), np.float32)  # (13, 13)
    y_true_26 = np.zeros((img_size[1] // 16, img_size[0] // 16), np.float32)  # (26, 26)
    y_true_52 = np.zeros((img_size[1] // 8, img_size[0] // 8), np.float32)    # (52, 52)

    try:
        for box in bbox:
            x1 = box[0]
            y1 = box[1]
            x2 = box[2]
            y2 = box[3]

            # 对应于原图的坐标进行缩小32倍，即为相对于特征图(13， 13)的坐标。
            scale_13_x1 = int(x1 // 32)
            scale_13_y1 = int(y1 // 32)
            scale_13_x2 = int(x2 // 32)
            scale_13_y2 = int(y2 // 32)
            # 在之前新建的(13， 13)全0张量上，将2d检测框区域的值都设置为1。
            y_true_13[int(scale_13_y1):int(scale_13_y2) + 1, int(scale_13_x1):int(scale_13_x2) + 1] = 1

            # 对应于原图的坐标进行缩小16倍，即为相对于特征图(26， 26)的坐标。
            scale_26_x1 = int(x1 // 16)
            scale_26_y1 = int(y1 // 16)
            scale_26_x2 = int(x2 // 16)
            scale_26_y2 = int(y2 // 16)
            # 在之前新建的(26， 26)全0张量上，将2d检测框区域的值都设置为1。
            y_true_26[int(scale_26_y1):int(scale_26_y2) + 1, int(scale_26_x1):int(scale_26_x2) + 1] = 1

            # 对应于原图的坐标进行缩小8倍，即为相对于特征图(52， 52)的坐标。
            scale_52_x1 = int(x1 // 8)
            scale_52_y1 = int(y1 // 8)
            scale_52_x2 = int(x2 // 8)
            scale_52_y2 = int(y2 // 8)
            # 在之前新建的(52， 52)全0张量上，将2d检测框区域的值都设置为1。
            y_true_52[int(scale_52_y1):int(scale_52_y2) + 1, int(scale_52_x1):int(scale_52_x2) + 1] = 1

        y_true_13 = tf.convert_to_tensor(y_true_13)
        y_true_26 = tf.convert_to_tensor(y_true_26)
        y_true_52 = tf.convert_to_tensor(y_true_52)

        return y_true_13, y_true_26, y_true_52

    except:
        return None

joshi-bharat commented 2 years ago

I am not sure about TensorFlow static because I remember getting the correct bounding box. I will need to verify this issue. I do not currently have the hardware set up for this. I will try to replicate it as soon as possible. One solution could be to not do that calculation inside pose loss and just return values; then doing the calculation in NumPy later the graph execution is finished.

julingers commented 2 years ago

@joshi-bharat, I have fixed it by deprecating if-else and using try except method instead. And I test it，the accuracy is about the same as before, why hasn't the accuracy improved, even a little bit? What do you think？

joshi-bharat commented 2 years ago

If the issue is a static graph, then there should not be any if conditions. I will check this maybe on the weekend. I was getting bounding box predictions during my experiments as I have displayed in results on the paper.

How did you find this issue meaning which test set were you running?

julingers commented 2 years ago

You need to remove the initial if bbox is None: return None if you want to make this work.

yes, I changed. And I also changed the if-else code. And now it works correctly. Tomorrow I will push my code to communicate with you.

joshi-bharat commented 2 years ago

Does it changes the accuracy at all?

julingers commented 2 years ago

How did you find this issue meaning which test set were you running?

I debugged the code for my own synthetics dataset, and suddenly found that the confs are negative values or greater than 1 values. so I found your conf didn't add sigmoid function in the else branch, but in the if branch you have added sigmoid function. Finally, I found the program only executes else branch.

julingers commented 2 years ago

testing in your Aquar pool dataset, using my reproduced checkpoint after fix

before fix

Looks like the two are about the same.

joshi-bharat commented 2 years ago

Yeah, they are almost the same. Did you use cyclegan or any new GAN algorithms?

julingers commented 2 years ago

I just used cyclegan to synthesize my own dataset.

julingers commented 2 years ago

@joshi-bharat , I reforked your latest code and fixed the issue. I tested it before creating pull request. Now it can work correctly.

joshi-bharat / deep_underwater_localization

Maybe I found a bug in the source code #10