j96w / DenseFusion

"DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion" code repository
https://sites.google.com/view/densefusion
MIT License
1.07k stars 301 forks source link

Own dataset training results are not accurate #207

Open fbas-est opened 2 years ago

fbas-est commented 2 years ago

Hi,

I try to train the network on my own dataset but the results are not good enough despite the fact that the model converge. I’ve a dataset of a total of 3000 annotated images. My camera is a realsense depth camera D415 with the following parameters: "fx": 607.3137817382812 "fy": 606.8499145507812 "ppx": 330.49334716796875 "ppy": 239.25704956054688 "height": 480 "width": 640 "depth_scale": 0.0010000000474974513 I’ve created my own dataset.py with respect to the linemod’s dataset.py but I changed the following lines:

cam_scale = 1.0 pt2 = depth_masked / cam_scale pt0 = (ymap_masked - self.cam_cx) pt2 / self.cam_fx pt1 = (xmap_masked - self.cam_cy) pt2 / self.cam_fy cloud = np.concatenate((pt0, pt1, pt2), axis=1) cloud = cloud / 1000.0

to:

cam_scale = self.cam_scale # 0.0010000000474974513 pt2 = depth_masked cam_scale pt0 = (ymap_masked - self.cam_cx) pt2 / self.cam_fx pt1 = (xmap_masked - self.cam_cy) * pt2/ self.cam_fy cloud = np.concatenate((pt0, pt1, pt2), axis=1) cloud = cloud

I also removed every division by 1000 in the code because my mesh values are already in meters.

The object’s diam is: 0.324 The estimator’s loss is: 0.0146578 and the refiner’s loss is : 0.01338558

Any idea of what is wrong with my iplementation? Thanks.

jc0725 commented 2 years ago

@fbas-est Hello. This is unrelated to your question, but I am also trying to use DenseFusion on my own dataset. May I ask what your environment settings are (CUDA version, etc.), and the steps for how you successfully managed to build using your own dataset? Thank you in advance.

Xushuangyin commented 2 years ago

Hello, I'm also making my own datasets for training and using realsense camera to estimate the attitude of objects. I've also encountered some problems. Is it convenient to add a contact information for communication? My wechat is 18845107925

fbas-est commented 2 years ago

@jc0725 Hello I use CUDA 10.1 and PyTorch 1.6. To build my dataset I used ObjectDatasetTools. You can find the source code from github: https://github.com/F2Wang/ObjectDatasetTools In order to make it work I changed the format of the dataset to comply with the format of the DenseFusion's Linemod Dataset.

jc0725 commented 2 years ago

@fbas-est Thank you for your response. May I ask how you trained the SegNet for LINEMOD? Did you change the "--dataset_root" directory to LINEMOD instead of YCB in ./vanilla_segmentation/train.py ?

Also, after training, what script did you run to get the 6DoF results?

I apologize if my questions are quite elementary.

fbas-est commented 2 years ago

@jc0725 Yes. I also changed dataset.py a bit in order to work for my dataset. A slighty different version of eval_linemod.py with some functions for visualizing the 3D bounding box

jc0725 commented 2 years ago

@fbas-est Would it be possible for you to upload your working code to your repository so that I can clone it?

Xushuangyin commented 2 years ago

Thank you very much for your reply. I also used ObjectDatasetTools to make my own dataset. I made 10000 pictures of a single object, but after training 20epoch, the posture of the model was changed greatly when I called the model to pose the object. I wanted to ask you how many rounds you trained, and how did you get the green bounding box in your video? Thank you. @fbas-est

Xushuangyin commented 2 years ago

https://user-images.githubusercontent.com/80498463/163365135-f3d2ad4b-bb98-4961-abaa-a2f62fa86837.mp4

fbas-est commented 2 years ago

Here is the code for visualizing: visualize.txt

@Xushuangyin
You produced 10000 pictures from one video or from different videos? In my case I used different videos due to RAM limitations. The problem was that every video produce pointclouds with different rotation and translation matrices and so the model could not use the same mesh for all the combined dataset.

Xushuangyin commented 2 years ago

I made 10000 pictures from different videos. If there are too many pictures, the program will report an error. I made my own object grid. How can I solve the problem you said? @fbas-est

Xushuangyin commented 2 years ago

Thank you very much for your code! @fbas-est

jc0725 commented 2 years ago

@fbas-est Thank you very much. I will let you know if I am able to make any improvements or if I come up with any suggestions for improved accuracy on your project.

fbas-est commented 2 years ago

@Xushuangyin I suggest to begin by finding a way to render the point cloud into the labeled dataset's color images (3D bounding box won't work). If the target pointcloud (the pointcloud used as label) is not accurate then the network won't work. If that's the problem, then for every video collected you need to change the transforms in the file transforms.npy so that they have one mesh as reference and then label them with that mesh

Xushuangyin commented 2 years ago

My datasets is composed of different videos, but the grid used by each video is generated at that time. Each grid is different, but when I train the network, the grid loaded is made by myself. Will this affect the accuracy of the model?

------------------ 原始邮件 ------------------ 发件人: "j96w/DenseFusion" @.>; 发送时间: 2022年4月15日(星期五) 晚上8:16 @.>; @.**@.>; 主题: Re: [j96w/DenseFusion] Own dataset training results are not accurate (Issue #207)

@Xushuangyin I suggest to begin by finding a way to render the point cloud into the labeled dataset's color images (3D bounding box won't work). If the target pointcloud (the pointcloud used as label) is not accurate then the network won't work. If that's the problem, then for every video collected you need to change the transforms in the file transforms.npy so that they have one mesh as reference and then label them with that mesh

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

orangeRobot990 commented 2 years ago

do you guys resize images during inference ? i get weird convolution errors :

RuntimeError: Calculated padded input size per channel: (6 x 320). Kernel size: (7 x 7). Kernel size can't be greater than actual input size

RuntimeError: Calculated padded input size per channel: (6 x 287). Kernel size: (7 x 7). Kernel size can't be greater than actual input size

its different each time, so i guess its the image or mask size ? where should i resize ?

@Xushuangyin @fbas-est thank you

Xushuangyin commented 2 years ago

Can I see your specific error code?

------------------ 原始邮件 ------------------ 发件人: "j96w/DenseFusion" @.>; 发送时间: 2022年4月23日(星期六) 凌晨1:50 @.>; @.**@.>; 主题: Re: [j96w/DenseFusion] Own dataset training results are not accurate (Issue #207)

do you guys resize images during inference ? i get weird convolution errors :

RuntimeError: Calculated padded input size per channel: (6 x 320). Kernel size: (7 x 7). Kernel size can't be greater than actual input size

RuntimeError: Calculated padded input size per channel: (6 x 287). Kernel size: (7 x 7). Kernel size can't be greater than actual input size

its different each time, so i guess its the image or mask size ? where should i resize ?

@Xushuangyin @fbas-est thank you

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

an99990 commented 2 years ago

hi @Xushuangyin thank you for responding, i actually found the source it was because i was transposing the array incorrectly.

an99990 commented 2 years ago

right now @Xushuangyin i a having issues with nana values in my training when i removed the /1000 since my depth and other metrics are in meters.

I also reduced the learning rate but i still get nan

an99990 commented 2 years ago

@Xushuangyin so now i just have giant results. I confirmed that my meshes are in meters so i removed the /1000.

image

Full code here


from importlib.abc import Loader
import torch.utils.data as data
from PIL import Image
import os
import os.path
import errno
import torch
import json
import codecs
import numpy as np
import sys
import torchvision.transforms as transforms
import argparse
import json
import time
import random
import numpy.ma as ma
import copy
import scipy.misc
import scipy.io as scio
import yaml
import cv2

class PoseDataset(data.Dataset):
    def __init__(self, mode, num, add_noise, root, noise_trans, refine):
        self.objlist = [0, 1]
        self.mode = mode

        self.list_rgb = []
        self.list_depth = []
        self.list_label = []
        self.list_obj = []
        self.list_rank = []
        self.meta = {}
        self.pt = {}
        self.root = root
        self.noise_trans = noise_trans
        self.refine = refine
        min = 1000

        item_count = 0
        for item in self.objlist:
            if self.mode == 'train':
                input_file = open('{0}/data/{1}/train.txt'.format(self.root, '%d' % item))
            else:
                input_file = open('{0}/data/{1}/test.txt'.format(self.root, '%d' % item))
            while 1:
                item_count += 1
                input_line = input_file.readline()
                if self.mode == 'test' and item_count % 10 != 0:
                    continue
                if not input_line:
                    break
                if input_line[-1:] == '\n':
                    input_line = input_line[:-1]
                self.list_rgb.append('{0}/data/{1}/rgb/{2}.jpg'.format(self.root, '%d' % item, input_line))
                self.list_depth.append('{0}/data/{1}/depth/{2}.png'.format(self.root, '%d' % item, input_line))
                if self.mode == 'eval':
                    self.list_label.append('{0}/segnet_results/{1}_label/{2}_label.png'.format(self.root, '%d' % item, input_line))
                else:
                    self.list_label.append('{0}/data/{1}/mask/{2}.png'.format(self.root, '%d' % item, input_line))

                self.list_obj.append(item)
                self.list_rank.append(int(input_line))

            meta_file = open('{0}/data/{1}/gt.yml'.format(self.root, '%d' % item), 'r')
            self.meta[item] = yaml.safe_load(meta_file)
            self.pt[item] = npy_vtx('{0}/models/{1}.npy'.format(self.root, '%d' % item))

            if len(self.pt[item]) < min:
                min = len(self.pt[item])

            print("Object {0} buffer loaded".format(item))

        self.length = len(self.list_rgb)
        self.num_pt_mesh_small = min

        # retrieved from /usr/local/zed/settings according to 
        # https://support.stereolabs.com/hc/en-us/articles/360007497173-What-is-the-calibration-file-
        self.cam_cx = 1080.47
        self.cam_cy = 613.322
        self.cam_fx = 1057.8
        self.cam_fy = 1056.61

        self.num = num
        self.add_noise = add_noise
        self.trancolor = transforms.ColorJitter(0.2, 0.2, 0.2, 0.05)
        self.norm = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        self.border_list = [-1, 40, 80, 120, 160, 200, 240, 280, 320, 360, 400, 440, 480, 520, 560, 600, 640, 680]
        self.num_pt_mesh_large = 500
        # self.num_pt_mesh_small = 100
        self.symmetry_obj_idx = []

    def __getitem__(self, index):
        img = Image.open(self.list_rgb[index])
        ori_img = np.array(img)
        depth = np.array(Image.open(self.list_depth[index]))
        label = np.array(Image.open(self.list_label[index]))

        self.height, self.width, _ = np.shape(img)

        self.xmap = np.array([[j for i in range(self.width)] for j in range(self.height)])
        self.ymap = np.array([[j for i in range(self.width)] for j in range(self.height)])

        # # removing alpha channel
        if np.shape(label)[-1] == 4 :
            label = label[:,:,:-1] 

        obj = self.list_obj[index]
        rank = self.list_rank[index]        

        if obj == 2:
            for i in range(0, len(self.meta[obj][rank])):
                if self.meta[obj][rank][i]['obj_id'] == 2:
                    meta = self.meta[obj][rank][i]
                    break
        else:
            meta = self.meta[obj][rank][0]
        #return array of bools
        mask_depth = ma.getmaskarray(ma.masked_not_equal(depth, 0))
        if self.mode == 'eval':
            mask_label = ma.getmaskarray(ma.masked_equal(label, np.array(255)))
        else:
            mask_label = ma.getmaskarray(ma.masked_equal(label, np.array([255, 255, 255])))[:, :, 0]

        mask = mask_label * mask_depth

        if self.add_noise:
            img = self.trancolor(img)

        # remove alpha channel
        img = np.array(img)[:, :, :3]
        img = np.transpose(img, (2, 0, 1))
        img_masked = img

        if self.mode == 'eval':
            rmin, rmax, cmin, cmax = get_bbox(mask_to_bbox(mask_label))
        else: #obj_bb: [minX, minY, widhtOfBbx, heigthOfBbx]
            rmin, rmax, cmin, cmax = get_bbox(meta['obj_bb'])

        img_masked = img_masked[:, rmin:rmax, cmin:cmax]
        # p_img = np.transpose(img_masked, (1, 2, 0))
        # cv2.imwrite('{0}_input.png'.format(index), p_img)

        choose = mask[rmin:rmax, cmin:cmax].flatten().nonzero()[0]
        if len(choose) == 0:
            cc = torch.LongTensor([0])
            return(cc, cc, cc, cc, cc, cc)

        if len(choose) > self.num:
            c_mask = np.zeros(len(choose), dtype=int)
            c_mask[:self.num] = 1
            np.random.shuffle(c_mask)
            choose = choose[c_mask.nonzero()]
        else:
            choose = np.pad(choose, (0, self.num - len(choose)), 'wrap')

        depth_masked = depth[rmin:rmax, cmin:cmax].flatten()[choose][:, np.newaxis].astype(np.float32)
        xmap_masked = self.xmap[rmin:rmax, cmin:cmax].flatten()[choose][:, np.newaxis].astype(np.float32)
        ymap_masked = self.ymap[rmin:rmax, cmin:cmax].flatten()[choose][:, np.newaxis].astype(np.float32)
        choose = np.array([choose])

        cam_scale = 1.0
        pt2 = depth_masked / cam_scale
        pt0 = (ymap_masked - self.cam_cx) * pt2 / self.cam_fx
        pt1 = (xmap_masked - self.cam_cy) * pt2 / self.cam_fy
        cloud = np.concatenate((pt0, pt1, pt2), axis=1)
        # cloud = cloud / 1000.0
        cloud = cloud 

        #fw = open('evaluation_result/{0}_cld.xyz'.format(index), 'w')
        #for it in cloud:
        #    fw.write('{0} {1} {2}\n'.format(it[0], it[1], it[2]))
        #fw.close()

        # model_points = self.pt[obj] / 1000.0
        model_points = self.pt[obj]
        dellist = [j for j in range(0, len(model_points))]
        dellist = random.sample(dellist, len(model_points) - self.num_pt_mesh_small)
        model_points = np.delete(model_points, dellist, axis=0)

        target_r = np.resize(np.array(meta['cam_R_m2c']), (3, 3))
        target_t = np.array(meta['cam_t_m2c'])
        add_t = np.array([random.uniform(-self.noise_trans, self.noise_trans) for i in range(3)])

        if self.add_noise:
            cloud = np.add(cloud, add_t)

        #fw = open('evaluation_result/{0}_model_points.xyz'.format(index), 'w')
        #for it in model_points:
        #    fw.write('{0} {1} {2}\n'.format(it[0], it[1], it[2]))
        #fw.close()

        target = np.dot(model_points, target_r.T)
        # if self.add_noise:
        #     target = np.add(target, target_t / 1000.0 + add_t)
        #     out_t = target_t / 1000.0 + add_t
        # else:
        #     target = np.add(target, target_t / 1000.0)
        #     out_t = target_t / 1000.0

        if self.add_noise:
            target = np.add(target, target_t + add_t)
            out_t = target_t + add_t
        else:
            target = np.add(target, target_t)
            out_t = target_t 
        #fw = open('evaluation_result/{0}_tar.xyz'.format(index), 'w')
        #for it in target:
        #    fw.write('{0} {1} {2}\n'.format(it[0], it[1], it[2]))
        #fw.close()

        # np.shape(cloud) (500, 3)
        # np.shape(choose) (1, 500)
        # np.shape(img_masked) (3, 120, 80)
        # np.shape(target) (24, 3)
        # np.shape(model_points) (24, 3)

        return torch.from_numpy(cloud.astype(np.float32)), \
               torch.LongTensor(choose.astype(np.int32)), \
               self.norm(torch.from_numpy(img_masked.astype(np.float32))), \
               torch.from_numpy(target.astype(np.float32)), \
               torch.from_numpy(model_points.astype(np.float32)), \
               torch.LongTensor([self.objlist.index(obj)])

    def __len__(self):
        return self.length

    def get_sym_list(self):
        return self.symmetry_obj_idx

    def get_num_points_mesh(self):
        if self.refine:
            return self.num_pt_mesh_large
        else:
            return self.num_pt_mesh_small

border_list = [-1, 40, 80, 120, 160, 200, 240, 280, 320, 360, 400, 440, 480, 520, 560, 600, 640, 680]

def mask_to_bbox(mask):
    mask = mask.astype(np.uint8)
    contours, _ = cv2.findContours(mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

    x = 0
    y = 0
    w = 0
    h = 0
    for contour in contours:
        tmp_x, tmp_y, tmp_w, tmp_h = cv2.boundingRect(contour)
        if tmp_w * tmp_h > w * h:
            x = tmp_x
            y = tmp_y
            w = tmp_w
            h = tmp_h
    return [x, y, w, h]

def get_bbox(bbox):
    bbx = [bbox[1], bbox[1] + bbox[3], bbox[0], bbox[0] + bbox[2]]
    if bbx[0] < 0:
        bbx[0] = 0
    if bbx[1] >= 540:
        bbx[1] = 539
    if bbx[2] < 0:
        bbx[2] = 0
    if bbx[3] >= 960:
        bbx[3] = 959                
    rmin, rmax, cmin, cmax = bbx[0], bbx[1], bbx[2], bbx[3]
    r_b = rmax - rmin
    for tt in range(len(border_list)):
        if r_b > border_list[tt] and r_b < border_list[tt + 1]:
            r_b = border_list[tt + 1]
            break
    c_b = cmax - cmin
    for tt in range(len(border_list)):
        if c_b > border_list[tt] and c_b < border_list[tt + 1]:
            c_b = border_list[tt + 1]
            break
    center = [int((rmin + rmax) / 2), int((cmin + cmax) / 2)]
    rmin = center[0] - int(r_b / 2)
    rmax = center[0] + int(r_b / 2)
    cmin = center[1] - int(c_b / 2)
    cmax = center[1] + int(c_b / 2)
    if rmin < 0:
        delt = -rmin
        rmin = 0
        rmax += delt
    if cmin < 0:
        delt = -cmin
        cmin = 0
        cmax += delt
    if rmax > 540:
        delt = rmax - 540
        rmax = 540
        rmin -= delt
    if cmax > 960:
        delt = cmax - 960
        cmax = 960
        cmin -= delt
    return rmin, rmax, cmin, cmax

def ply_vtx(path):
    f = open(path)
    assert f.readline().strip() == "ply"
    f.readline()
    f.readline()
    N = int(f.readline().split()[-1])
    while f.readline().strip() != "end_header":
        continue
    pts = []
    for _ in range(N):
        pts.append(np.float32(f.readline().split()[:3]))
    return np.array(pts)

def npy_vtx(path):
    return np.load(path,allow_pickle=True)

Thank you for your help @Xushuangyin

orangeRobot990 commented 2 years ago

Hey @fbas-est , I'm having issues with my training as well. Did you notice anything weird in your avg distance when you removed /1000 ? Did you remove it anywhere else than dataset.py ?

Thank you @Xushuangyin and @an99990 i solve it with the array. Now i have issues with training and gettingd nans too because my stuff are in meters .. Thanks for any help

Xushuangyin commented 2 years ago

cam_scale = 0.001 pt2 = depth_masked * cam_scale You should change these two lines of code like this

Xushuangyin commented 2 years ago

Because of my cam_ Scale = 0.001, so the code I modified is like this @an99990 @orangeRobot990 148d312f447e3d5fd5762d2a20ce6b2

an99990 commented 2 years ago

thank you so much @Xushuangyin , i was able to finally have results using cam_scale/0.001 and without dividing/1000 in getittem. I will start another training with the correct values. thank you so much !

jc0725 commented 2 years ago

Hello. May I ask how any of you were able to train your custom dataset on SegNet? It seems like the provided code is for YCB format and not Linemod format.

My guess was that I would have to run the SegNet train.py for each of the individual objects for Linemod.

Xushuangyin commented 2 years ago

I used labelme to label the objects in the custom dataset.

------------------ 原始邮件 ------------------ 发件人: "j96w/DenseFusion" @.>; 发送时间: 2022年5月11日(星期三) 中午1:35 @.>; @.**@.>; 主题: Re: [j96w/DenseFusion] Own dataset training results are not accurate (Issue #207)

Hello. May I ask how any of you were able to train your custom dataset on SegNet? It seems like the provided code is for YCB format and not Linemod format.

My guess was that I would have to run the SegNet train.py for each of the individual objects for Linemod.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

jc0725 commented 2 years ago

Thank you for your response. Do you mean that you didn't train SegNet?

Xushuangyin commented 2 years ago

I trained 300 pictures of a single object using Seg Net. @jc0725

jc0725 commented 2 years ago

@Xushuangyin Thank you for clarifying! Also, were you able to successfully visualize the bounding box using the visualize.py code provided by @fbas-est ?

fbas-est commented 2 years ago

@an99990 Hello I saw that you are using a ZED camera and from the intrinsic array I assume you didn't train the model at 480p resolution images. Did you successfully trained the model in higher resolution?

an99990 commented 2 years ago

@fbas-est I generated image from Unity. The image are 560 x 940 , if I remember correctly. My poses do not seem to be quite correct tho. Heres an image during inference. I might create a dataset with images from the ZED camera. The camera in Unity didnt have the same camera intrinsic as the ZED, so that might be why my results arent precised. I also never reached the refinement step during training.

image

fbas-est commented 2 years ago

@an99990 Yes that is probably the issue, ZED camera comes with 4 build in calibrations with the smallest being for 672x376 images. If you train the network with synthetic data I guess you have to replicate the images that your camera captures.

May I ask how you created the synthetic dataset ?

an99990 commented 2 years ago

i have a Unity project to create dataset with linemode format. I cant share it tho since it is not the companies stuff :/

jc0725 commented 2 years ago

May I ask how any of you were able to output and save the vanilla_segmentation label png files?

XLXIAOLONG commented 2 years ago

@an99990 Hello. i make a linemod dataset by Objectdatasettools. in the eval_linemod.py, it's success rate is 0.9285. but when i visualize it, the point seems to be in the wrong place. Can you give me some advice? Thank you in advance! 2022-05-17 21-55-54屏幕截图

an99990 commented 2 years ago

Have you payed with the cam_scale ? i had to change it to 1000, try with different values, it seems that its bigger than your object

XLXIAOLONG commented 2 years ago

Have you payed with the cam_scale ? i had to change it to 1000, try with different values, it seems that its bigger than your object

@an99990 Thanks for your reply. I make the dataset by realsense. I change the cam_scale to it's own value, like this cam_scale = 0.0002500000118743628 pt2 = depth_masked * cam_scale pt0 = (ymap_masked - self.cam_cx) * pt2 / self.cam_fx pt1 = (xmap_masked - self.cam_cy) * pt2 / self.cam_fy cloud = np.concatenate((pt0, pt1, pt2), axis=1) # cloud = cloud / 1000.0 # print(cloud.max()) cloud = cloud

0.0002500000118743628 is the depth scale of real camera.

Windson9 commented 1 year ago

Hi @Xushuangyin and @an99990. I hope you are doing well. I am trying to train this model on my custom dataset. Can you please share if you were able to successfully train the model? Can you share the results if possible? Thanks.

nanxiaoyixuan commented 2 months ago

@jc0725 Hi, I also trained myself to build linemod datasets, and when I debug, I found that 'input_file = open('{0}/data/{1}/train.txt'.format(self.root, '%02d' % item)) 'error' No such file or Directory: '/ datasets/linemod/linemod_preprocessed/data / 01 / train. txt', 'cause I won't be able to view the subsequent code to run through the debug. But through the command 'bash. / experiments/scripts/train_linemod sh' can be trained, not appear this kind of error, excuse me you had met this kind of situation? Is there any solution? Thank you very much for your reply.

nanxiaoyixuan commented 2 months ago

@fbas-est Hi, I also trained myself to build linemod datasets, and when I debug, I found that 'input_file = open('{0}/data/{1}/train.txt'.format(self.root, '%02d' % item)) 'error' No such file or Directory: '/ datasets/linemod/linemod_preprocessed/data / 01 / train. txt', 'cause I won't be able to view the subsequent code to run through the debug. But through the command 'bash. / experiments/scripts/train_linemod sh' can be trained, not appear this kind of error, excuse me you had met this kind of situation? Is there any solution? Thank you very much for your reply.