Help with inference/training for own test data

xavidzo commented 3 years ago

Hi, I think it's really amazing the work you've done, congrats!! For my thesis project, I would like to use your network to analyze point cloud data, so right now what I have is only a series of .pcd files that are recordings of vehicles on the highway. A lidar sensor was mounted on a bridge, so the recordings are from a static point of view. Could you please give some hints how to prepare these data as an input to your network to perform the 3d object detection? Do I have to prepare database tables in .json format according to the nuscenes specifications? Is it a must? What other steps are required?

Thank you in advance

tianweiy commented 3 years ago

Hi, thanks for your interest.

I will have a large update to the codebase in the coming weeks. And I will make sure to add some code for own data inference in the update.

huster-wugj commented 3 years ago

hello, there is a python file in tools/single_inference.py, but i do not understand the code in it:

 sub_lidar_topic = [ "/velodyne_points", 
                    "/top/rslidar_points",
                    "/points_raw", 
                    "/lidar_protector/merged_cloud", 
                    "/merged_cloud",
                    "/lidar_top", 
                    "/roi_pclouds"]

sub_ = rospy.Subscriber(sub_lidar_topic[5], PointCloud2, rslidar_callback, queue_size=1, buff_size=2**24)

what does the "/roi_pclouds" mean? Is there any tools for transfering the topic "/velodyne_points" to "/roi_pclouds"? or can I use "/velodyne_points" directly?

tianweiy commented 3 years ago

@huster-wugj this is a community contribution. You can find details here https://github.com/tianweiy/CenterPoint/pull/11 You can't directly use /velodyne_points currently. I will add script to support this later.

YoushaaMurhij commented 3 years ago

This file represents a ROS node (robotic operating system) for inference using point clouds in ROS. "/roi_pclouds" is the name of the topic the you need to subscribe to , in order to receive a point cloud in ROS. Please note that this code is not oriented for inference from .bin or .pcd files!

YoushaaMurhij commented 3 years ago

You can use something like this for .bin files:

    def load_cloud_from_nuscenes_file(pc_f):
        logging.info('loading cloud from: {}'.format(pc_f))
        num_features = 5
        cloud = np.fromfile(pc_f, dtype=np.float32, count=-1).reshape([-1, num_features])
        # last dimension should be the timestamp.
        cloud[:, 4] = 0
        return cloud

    @staticmethod
    def load_cloud_from_deecamp_file(pc_f):
        logging.info('loading cloud from: {}'.format(pc_f))
        num_features = 4
        cloud = np.fromfile(pc_f, dtype=np.float32, count=-1).reshape([-1, num_features])
        # last dimension should be the timestamp.
        cloud = np.hstack((cloud, np.zeros([cloud.shape[0], 1])))
        return cloud

    def predict_on_local_file(self, cloud_file, i):

        # load sample from file
        #self.points = self.load_cloud_from_nuscenes_file(cloud_file)
        #self.points = self.load_cloud_from_deecamp_file(cloud_file)
        self.points = self.load_cloud_from_my_file(cloud_file)

        # prepare input
        voxels, coords, num_points = self.voxel_generator.generate(self.points)
        num_voxels = np.array([voxels.shape[0]], dtype=np.int64)
        grid_size = self.voxel_generator.grid_size
        coords = np.pad(coords, ((0, 0), (1, 0)), mode='constant', constant_values=0)

        voxels = torch.tensor(voxels, dtype=torch.float32, device=self.device)
        coords = torch.tensor(coords, dtype=torch.int32, device=self.device)
        num_points = torch.tensor(num_points, dtype=torch.int32, device=self.device)
        num_voxels = torch.tensor(num_voxels, dtype=torch.int32, device=self.device)

        self.inputs = dict(
            voxels=voxels,
            num_points=num_points,
            num_voxels=num_voxels,
            coordinates=coords,
            shape=[grid_size]
        )

        # predict
        torch.cuda.synchronize()
        tic = time.time()
        with torch.no_grad():
            outputs = self.net(self.inputs, return_loss=False)[0]

        torch.cuda.synchronize()
        logging.info("Prediction time: {:.3f} s".format(time.time() - tic))

        for k, v in outputs.items():
            if k not in [
                "metadata",
            ]:
                outputs[k] = v.to('cpu')

        # visualization
        #print(outputs)
        visual_detection(i, np.transpose(self.points), outputs, conf_th=0.5, show_plot=False, show_3D=False)

muzi2045 commented 3 years ago

it's just a ROS topic name, you can change it if have your own rosbag data.

tianweiy commented 3 years ago

I guess the problem is solved. Reopen if you still have some issues

xavidzo commented 3 years ago

Nevertheless, for me, it is still unclear how I can use centerpoint for training on my own custom dataset Can you guys provide more detailed steps, please? For training do I have to label data in the same format as nuscenes or waymo? Should I adapt the "def create_groundtruth_database" function from det3d.datasets.utils.create_gt_database and define a new function inside "tools/create_data.py" similar to "def nuscenes_data_prep"?

Why do you generate pickle files?

In https://github.com/tianweiy/CenterPoint/blob/master/docs/NUSC.md you mentioned for training the "tools/dist_test.py" should be used... Then what's the purpose of the "tools/train.py" script?

Excuse me, perhaps my python skills are not advanced, that's why I cannot see how I can apply centerpoint to my own lidar pcd files.

(This issue was about inference, now I mean training. If you consider I should open a new issue, just let me know)

tianweiy commented 3 years ago

For training do I have to label data in the same format as nuscenes or waymo?

I think both of these are difficult. I think the easiest way is to look at my waymo dataset class and write a wrapper like this (please refer to the waymo.py and loading.py files). You want to have lidar data and then also label the objects in the lidar frame with 7 dim boxes and object class. If you also want to do temporal stuff (multi-sweep model), this will be hard and need some sensor calibrations which I am not familiar with.

Should I adapt the "def create_groundtruth_database" function from det3d.datasets.utils.create_gt_database and define a new function inside "tools/create_data.py" similar to "def nuscenes_data_prep"?

Yes..

Why do you generate pickle files?

To save the calibration/file paths/labels all into a single file for efficient accessing.

In https://github.com/tianweiy/CenterPoint/blob/master/docs/NUSC.md you mentioned for training the "tools/dist_test.py" should be used... Then what's the purpose of the "tools/train.py" script?

dist_test for testing. train for train. Where did I mention that dist_test is for training?

Let me know if you have further questions.

Best

muzi2045 commented 3 years ago

find the lidar label tool, and define your own label json format or txt.
collect the raw sensor data like lidar, camera, imu, GPS, etc..., you must be doing timestamp sync in hardware for further using.
preprocess the raw sensor data and split the lidar data into frame by frame like .bin(KITTI), pcd or other pointcloud format.
annoate the data and get your own dataset.
write the correct dataset api warper for your own datatset with any training codebase.

xavidzo commented 3 years ago

Ok thanks a lot for your helpful indications! I have two questions more:

Since my training set is rather small, approx. 1200 lidar frames or less, I guess it makes sense to begin training loading weights from some of the already available pretrained networks. Do you have maybe any recommendations on hyperparameters setting? Suggestions for how many number of epochs I should train given such small dataset size?
I wish to apply a Point-Painting technique for sequential fusion of camera images and lidar as described here: https://arxiv.org/abs/1911.10150

So basically, after training an image segmentation network and projecting the point cloud points to images in order to find a correspondence between lidar points and pixels, I could figure out the semantic segmentation scores assigned to each lidar point. Then the point-cloud dimensionality must be augmented with these scores. (4 values: x, y, x, intensity plus C class scores). I want to do this with CenterPoint. Can you please give me a hint, where in the codebase I can change the input size of the PointPillars or VoxelNet backbones to realize this? Maybe another adaptions to the codebase I should consider for this PointPainting approach?

muzi2045 commented 3 years ago

It depend on your scene and the classes you want to detect, 1200 frames is enough to train a toy model for test, but it's really not enough for production deploy(10W+).
about the pointpainting , you can try to modify the preprocess part to attach segment score every pont, and change the vfe layer input channel, In Centerpoint , it trained on Nuscenes config have 5 channel points input(x, y, z, intensity, timestamp), and after computing , there will be 10 channel input to the PFE layer, the most easy way it to attach segment score to input(x, y, z, intensity, timestamp, class_score) and 11 channel to PFE layer, the output will still be MAX_PILLARS_NUM 64 MAX_POINTS_PER_PILLAR.

tianweiy commented 3 years ago

Thanks a lot @muzi2045

For 1, I don't have much experience.

For 2, actually, now https://github.com/tianweiy/CenterPoint/blob/fc9c9eef5534486c8a3b9e35c4682822bddc3498/det3d/datasets/pipelines/loading.py#L23 get a painted option. I guess you will first paint the points with a segmentation mask, and you can directly use these painted point clouds (only need to change the input_num_features to the corresponding channel number in the configs).

I implemented pointpainting for nuScenes in the past. You will need those calibrations matrixes for camera / lidar and do some good synchronization, which I feel is not trivial. You can get some idea from the following code that uses a mmdetection3d nuImages model. It is quite slow though. So you need to do some optimization.

import argparse
from nuscenes.utils.geometry_utils import view_points
import numpy as np 
import numba 
import pickle 
from tqdm import tqdm 
from mmdet.apis import inference_detector, init_detector
import torch 
import time 
import mmcv 

import platform
from functools import partial
import os 
from det3d.torchie.parallel import collate, collate_kitti
from det3d.torchie.trainer import get_dist_info
from torch.utils.data import DataLoader

from det3d.datasets.loader.sampler import (
    DistributedGroupSampler,
    DistributedSampler,
    DistributedSamplerV2,
    GroupSampler,
)
from collections import defaultdict 
from mmdet.datasets.pipelines import LoadImageFromFile
from det3d import torchie

if platform.system() != "Windows":
    # https://github.com/pytorch/pytorch/issues/973
    import resource

    rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
    resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))

FILE_CLIENT = mmcv.FileClient(backend='disk')

def load_image(filename):
    global FILE_CLIENT
    img_bytes = FILE_CLIENT.get(filename)
    img = mmcv.imfrombytes(img_bytes, flag='color')
    img = img.astype(np.float32)

    return img 

def simple_collate(batch_list):
    ret_dict = defaultdict(list)
    for i in range(len(batch_list)):
        ret_dict['lidars'].append(batch_list[i]['lidars'])
        ret_dict['imgs'].append(batch_list[i]['imgs'])
        ret_dict['sweeps'].append(batch_list[i]['sweeps'])

    return ret_dict 

def build_dataloader(
    dataset, batch_size, workers_per_gpu, num_gpus=1, dist=True, **kwargs
):

    data_loader = DataLoader(
        dataset,
        batch_size=1,
        sampler=None,
        shuffle=False,
        num_workers=4,
        collate_fn=simple_collate,
        pin_memory=False,
    )

    return data_loader 

def get_time():
    torch.cuda.synchronize()
    return time.time()

def read_file(path, tries=2, num_point_feature=5):
    points = None
    try_cnt = 0
    points = np.fromfile(path, dtype=np.float32).reshape(-1, 5)[:, :num_point_feature]

    return points

def get_obj(path):
    with open(path, 'rb') as f:
            obj = pickle.load(f)
    return obj 

def to_tensor(x, device='cuda:0'):
    return torch.tensor(x, dtype=torch.float32).to(device)

def view_points(points, view, normalize, device='cuda:0'):
    """
    This is a helper class that maps 3d points to a 2d plane. It can be used to implement both perspective and
    orthographic projections. It first applies the dot product between the points and the view. By convention,
    the view should be such that the data is projected onto the first 2 axis. It then optionally applies a
    normalization along the third dimension.

    For a perspective projection the view should be a 3x3 camera matrix, and normalize=True
    For an orthographic projection with translation the view is a 3x4 matrix and normalize=False
    For an orthographic projection without translation the view is a 3x3 matrix (optionally 3x4 with last columns
     all zeros) and normalize=False

    :param points: <np.float32: 3, n> Matrix of points, where each point (x, y, z) is along each column.
    :param view: <np.float32: n, n>. Defines an arbitrary projection (n <= 4).
        The projection should be such that the corners are projected onto the first 2 axis.
    :param normalize: Whether to normalize the remaining coordinate (along the third axis).
    :return: <np.float32: 3, n>. Mapped point. If normalize=False, the third coordinate is the height.
    """

    assert view.shape[0] <= 4
    assert view.shape[1] <= 4
    assert points.shape[0] == 3

    viewpad = torch.eye(4).to(device)
    viewpad[:view.shape[0], :view.shape[1]] = view

    nbr_points = points.shape[1]

    # Do operation in homogenous coordinates.
    points = torch.cat((points, torch.ones([1, nbr_points]).to(device)), dim=0)

    points = torch.matmul(viewpad, points)

    points = points[:3, :]

    if normalize:
        points = points / points[2:3, :].repeat(3, 1).reshape(3, nbr_points)

    return points

@torch.no_grad()
def paint(points, segmentations, all_cams_from_lidar, all_cams_intrinsic, num_cls=10, device='cuda:0'):
    num_lidar_point = points.shape[0]
    num_camera = len(all_cams_from_lidar)
    H, W, num_cls = segmentations[0].shape 

    # we keep a final mask to keep track all points that can project into an image
    # we set a point's rgb value to (0, 0, 0) if no projection is possible 
    # not quite sure how to deal with points that appear in multiple cameras. Now we include all of them and take the average 

    final_mask = torch.zeros([num_lidar_point], dtype=torch.bool).to(device)
    segmentation_map = torch.zeros([num_lidar_point, num_cls], dtype=torch.float32).to(device) 
    num_occurence = torch.zeros([num_lidar_point], dtype=torch.float32).to(device)  # keep track of how many pixels a point is projected into 
    points = to_tensor(points)

    for i in range(num_camera):
        tm = to_tensor(all_cams_from_lidar[i])
        intrinsic = to_tensor(all_cams_intrinsic[i])
        segmentation = to_tensor(segmentations[i])

        # transfer to camera frame 
        transform_points  = points.clone().transpose(1, 0)
        transform_points[:3, :] = torch.matmul(tm, torch.cat([
                transform_points[:3, :], 
                torch.ones(1, num_lidar_point, dtype=torch.float32, device=device)
            ], dim=0)
        )[:3, :]

        depths = transform_points[2, :]

        # take 2d image of the lidar point 
        points_2d = view_points(transform_points[:3, :], intrinsic, normalize=True).transpose(1, 0)[:, :2]

        points_seg = torch.zeros([num_lidar_point, num_cls], dtype=points.dtype, device=device)

        # N x 2 
        points_2d = torch.floor(points_2d)
        assert H == 900, W == 1600 

        points_x, points_y = points_2d[:, 0].long(), points_2d[:, 1].long()

        valid_mask = (points_x > 1) & (points_x < W-1) & (points_y >1) & (points_y < H-1)
        points_seg[valid_mask] = segmentation[points_y[valid_mask], points_x[valid_mask]]

        # ensure the points are in front of the camera
        mask = valid_mask & (depths > 0)

        segmentation_map[mask] += points_seg[mask]
        final_mask = mask | final_mask
        num_occurence += mask.float() 

    # for points that have at least one projection, we compute the average segmentation score
    projection_points = points[final_mask]
    segmentation_map = segmentation_map[final_mask]  / num_occurence[final_mask].unsqueeze(-1)  
    projection_points = torch.cat([projection_points, segmentation_map], dim=1)    

    # for points that are not in any images, we assign background class to the point 
    no_projection_points = points[torch.logical_not(final_mask)]
    fake_seg = torch.zeros([no_projection_points.shape[0], num_cls], dtype=torch.float32, device=device) 
    no_projection_points = torch.cat([no_projection_points, fake_seg], dim=1)

    points = torch.cat([projection_points, no_projection_points], dim=0)

    return points

def process_seg_result(result):
    # 10 x NUM_BOX x H x W 
    segs = [] 
    for i in range(len(result)):
        per_cls_seg = result[i]
        if len(per_cls_seg) == 0:
            per_cls_seg = np.zeros((900, 1600)) 
        else:
            per_cls_seg = np.stack(per_cls_seg, axis=0)
            # merge all box's instance mask 
            # if one pixel is in any of the boxes, we will set this pixel's value to 1
            # otherwise set it to zero 

            # N H W 
            per_cls_seg = np.mean(per_cls_seg, axis=0)
            per_cls_seg = (per_cls_seg>0).astype(np.float32)

        segs.append(per_cls_seg)

    # 10 x H x W 
    seg_mask = np.stack(segs, axis=0)

    seg_mask = np.transpose(seg_mask, (1, 2, 0)) # H x W x 10 
    return seg_mask 

from torch.utils.data import Dataset 
class PaintData(Dataset):
    def __init__(
        self,
        info_path,
        start=-1,
        end=-1
    ):
        if end == -1:
            self.infos = get_obj(info_path)[start:]
        elif start == -1:
            self.infos = get_obj(info_path)[:end+1]
        else:
            self.infos = get_obj(info_path)[start:end+1]
        self._set_group_flag()

    def __getitem__(self, index):
        info = self.infos[index]
        sweeps = [info] + info['sweeps']

        lidars = [] 
        imgs = [] 
        assert len(sweeps) == 10 
        for i in range(10):
            sweep = sweeps[i]
            lidar_path = sweep['lidar_path']
            lidar = read_file(lidar_path)          
            lidars.append(lidar)

            all_cams_path = sweep['all_cams_path'] 
            img = [] 
            for path in all_cams_path: 
                img.append(load_image(path))

            # img = np.stack(img, axis=0)
            imgs.append(img)        

        ret_dict = {
            'lidars': lidars, 
            'sweeps': sweeps,
            'imgs': imgs 
        }

        return ret_dict  

    def __len__(self):
        return len(self.infos)

    def _set_group_flag(self):
        """Set flag according to image aspect ratio.
        Images with aspect ratio greater than 1 will be set as group 1,
        otherwise group 0.
        """
        self.flag = np.ones(len(self), dtype=np.uint8)
        # self.flag = np.zeros(len(self), dtype=np.uint8)
        # for i in range(len(self)):
        #     img_info = self.img_infos[i]
        #     if img_info['width'] / img_info['height'] > 1:
        #         self.flag[i] = 1

@torch.no_grad()
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--info_path", type=str)
    parser.add_argument('--config', help='Config file')
    parser.add_argument('--checkpoint', help='Checkpoint file')
    parser.add_argument('--device', default='cuda:0', type=str)
    parser.add_argument('--start', type=int, default=0, )
    parser.add_argument('--end', type=int)
    args = parser.parse_args()

    dataset = PaintData(args.info_path, args.start, args.end)
    model = init_detector(args.config, args.checkpoint, device=args.device)

    loader = build_dataloader(dataset, 1, 4, dist=False)
    prog_bar = torchie.ProgressBar(len(dataset))

    for _, data_batch in enumerate(loader):
        lidars, sweeps, imgs = data_batch['lidars'], data_batch['sweeps'], data_batch['imgs']

        lidars = lidars[0]
        sweeps = sweeps[0]
        imgs = imgs[0]

        for i in range(10):
            sweep = sweeps[i]
            lidar_path = sweep['lidar_path']
            lidar = lidars[i]
            img = imgs[i]

            all_cams_from_lidar = sweep['all_cams_from_lidar']
            all_cams_intrinsic = sweep['all_cams_intrinsic']
            all_cams_path = sweep['all_cams_path'] 
            instance_seg_results = [] 

            for j in range(6):
                result = inference_detector(model, img[j])[1]
                seg_mask = process_seg_result(result)
                instance_seg_results.append(seg_mask)

            # run painting 
            painted_points = paint(lidar, instance_seg_results, all_cams_from_lidar, all_cams_intrinsic)

            dir_path = os.path.join(*lidar_path.split('/')[:-2], 'stronger_painted_'+lidar_path.split('/')[-2])

            if not os.path.isdir(dir_path):
                os.mkdir(dir_path)

            painted_path = os.path.join(dir_path, lidar_path.split('/')[-1])      

            np.save(painted_path, painted_points.cpu().numpy())

        prog_bar.update()

if __name__ == '__main__':
    main()

tianweiy commented 3 years ago

https://github.com/tianweiy/CenterPoint/blob/84fde67ac35a85b0364e10e75bfdf789d578024f/det3d/datasets/pipelines/loading.py#L1

xavidzo commented 3 years ago

Please correct me if I am wrong: a file like "infos_train_01sweeps_filter_zero_gt.pkl", has saved the aggregated ground truth information for all training frames at once, right? What's the meaning of the "filter_zero" in the filename? What's the difference between "dbinfos_train_1sweeps_withvelo.pkl" and "infos_train_01sweeps_filter_zero_gt.pkl"? Is it mandatory to create these files for a simple custom dataset?

In my case, I have all pcd files in one folder "pointclouds" and in another folder "annotations" I have the corresponding .json files with the labeled vehicles for each pcd / frame. One such .json file looks for example as follows:

{"name":"000000","timestamp":0,"index":0,"labels":[{"id":0,"category":"car","box3d":{"dimension":{"width":3.52,"length":1.9000000000000001,"height":1.58},"location":{"x":13.990062100364485,"y":17.122656041496878,"z":-6.110000000000003},"orientation":{"rotationYaw":0,"rotationPitch":0,"rotationRoll":0}}},{"id":1,"category":"car","box3d":{"dimension":{"width":3.41,"length":1.68,"height":1.58},"location":{"x":27.212441928732442,"y":-2.479418392114875,"z":-6.110000000000001},"orientation":{"rotationYaw":0,"rotationPitch":0,"rotationRoll":0}}}

So my guess is I should create one "info_train.pkl" that contains at least the aggregated data of all .json files, plus the filepath directory to all pcd files. Do I need to store more information in this pkl? I think I do not need to worry about frame transformations of any kind. My scenario is different from the ego-vehicle case since the lidar sensor was mounted on a bridge and remained static pointing to the highway. The global frame is in my case also the lidar frame. In the next image, a recording sequence is shown: ouster_OS1-64

I checked in nuScenes or Waymo, their labels' structure is very complex, they include far many more attributes than I have in the .json labels for my custom data. I see you include "num_points_in_gt" in the info.pkl (I assume that means "number of points inside each ground truth bounding box"). I don't have such attribute in my labels.

Based on my .json example above, will the object data attributes I collected be enough to perform training, or am I missing something else?

tianweiy commented 3 years ago

What's the meaning of the "filter_zero" in the filename?

filter out boxes with no lidar points

What's the difference between "dbinfos_train_1sweeps_withvelo.pkl" and "infos_train_01sweeps_filter_zero_gt.pkl"? Is it mandatory to create these files for a simple custom dataset?

infos for annotations, dbinfos for gt database. If you want to use gt aug, you need to generate dbinfos like we do in the create_data.py

I think the collected data are enough. 'num_points_in_gt' is not needed.

xavidzo commented 3 years ago

Hi, I have a couple more questions, if you could answer them please:

What is the sweeps attribute useful for? Why is it 10 on nuscenes and 1 or 2 on waymo? For using sweeps, each lidar frame must have a timestamp, otherwise it cannot be implemented, right?
I assume both nuscenes and waymo use 5 lidar features, which in the code translates to num_point_features = 5 = x, y, z, intensity and t (timestamp), correct ? But here in the code it seems that at least for nuscenes you only load the first 4, so x, y, z, and intensity:

https://github.com/tianweiy/CenterPoint/blob/84fde67ac35a85b0364e10e75bfdf789d578024f/det3d/datasets/pipelines/loading.py#L23

Looking at the next piece of code, so you transform from Waymo to kitti reference frame convention because the network backbone expects the input to be in kitti format? https://github.com/tianweiy/CenterPoint/blob/84fde67ac35a85b0364e10e75bfdf789d578024f/det3d/datasets/waymo/waymo_common.py#L265

do you make the same transform for nuscenes as well? I don't see the same operation here: https://github.com/tianweiy/CenterPoint/blob/84fde67ac35a85b0364e10e75bfdf789d578024f/det3d/datasets/nuscenes/nusc_common.py#L403

I read in waymo they labeled the bounding boxes' orientation only with the yaw angle, but when I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why? To be more specific, in my custom dataset objects are labeled like x, y, z, width (corresponds to x dimension), length (corresponds to y dimension), and height ( = z obvious). Do I need to transform something?

tianweiy commented 3 years ago

What is the sweeps attribute useful for? Why is it 10 on nuscenes and 1 or 2 on waymo? For using sweeps, each lidar frame must have a timestamp, otherwise it cannot be implemented, right?

It is precomputed path for lidar frames before the current referenced frame. It is used to aggregate multiframe lidar data. nuScenes lidar is 32 lanes and Waymo is 64 lanes. As nuScenes produces really sparse pointcloud, we use more frames to aggregate data. For Waymo, we can also do 5 or ten but it is slower. Yes

I assume both nuscenes and waymo use 5 lidar features, which in the code translates to num_point_features = 5 = x, y, z, intensity and t (timestamp), correct ? But here in the code it seems that at least for nuscenes you only load the first 4, so x, y, z, and intensity:

For nuScenes, we use x,y,z,intensity and timestamp. For Waymo, we use x, y, z, intensity, elongation. with optional timestamp for two frame model.

do you make the same transform for nuscenes as well? I don't see the same operation here:

This is already done in the construction of nuScenes info file.

I read in waymo they labeled the bounding boxes' orientation only with the yaw angle, but when I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why?

Not quite sure about 'I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why' I think there is only one value?

To be more specific, in my custom dataset objects are labeled like x, y, z, width (corresponds to x dimension), length (corresponds to y dimension), and height ( = z obvious). Do I need to transform something?

Please use x, y, z, length (corresponds to y), width(x), z, rotation

xavidzo commented 3 years ago

thanks a lot tianweiy as always for taking your time to reply, still have some questions:

a) "For nuScenes, we use x,y,z,intensity and timestamp. For Waymo, we use x, y, z, intensity, elongation. with optional timestamp for two frame model." However, here in the code, it seems like you only select the x, y, and intensity for nuScenes when you read the pcd file in the function read_file() from pipelines/loading.py. because num_point_feature=4 ??:

def read_file(path, tries=2, num_point_feature=4, painted=False):
    if painted:
     //////////////
    else:
          points = np.fromfile(path, dtype=np.float32).reshape(-1, 5)[:, :num_point_feature]

    return points

b) "Not quite sure about 'I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why' I think there is only one value?". For instance, this a sample of Waymo's infos_train_01sweeps_filter_zero_gt.pkl:

[{'path': 'data/Waymo/train/lidar/seq_0_frame_0.pkl', 'anno_path': 'data/Waymo/train/annos/seq_0_frame_0.pkl', 'token': 'seq_0_frame_0.pkl', 'timestamp': 1553629304.2740898, 'sweeps': [], 'gt_boxes': array([[-7.75765467e+00,  9.75379848e+00,  2.81035209e+00,
         5.00253379e-01,  1.01573773e-01,  4.49999988e-01,
        -3.18337395e-03,  3.55915539e-03, -1.53220856e+00],
       [ 2.29385357e+01, -6.12172985e+00,  2.50068998e+00,
         4.72529471e-01,  6.68555871e-02,  5.29999971e-01,
        -1.27970241e-03, -1.58406922e-03, -4.62591171e+00]
        ......................................................
        ......................................................

In the above sample, in the field 'gt_boxes' there are 9 values for each box, not just 7... An here is a sample of Waymo's dbinfos_train_1sweeps_withvelo.pkl :

{'VEHICLE': [{'name': 'VEHICLE', 'path': 'gt_database_1sweeps_withvelo/VEHICLE/0_VEHICLE_2.bin', 'image_idx': 0, 'gt_idx': 2, 'box3d_lidar': array([-5.8496891e+01,  4.4255123e+00,  2.4241050e-01,  1.7933178e+00,
        3.9103422e+00,  1.4900000e+00, -1.5062687e-01, -1.1980739e-02,
        1.4906577e+00], dtype=float32), 'num_points_in_gt': 51, 'difficulty': 0, 'group_id': 0}, {'name': 'VEHICLE', 'path': 'gt_database_1sweeps_withvelo/VEHICLE/0_VEHICLE_3.bin', 'image_idx': 0, 'gt_idx': 3, 'box3d_lidar': array([ 3.7829960e+01, -5.1944214e-01,  1.1195552e+00,  1.8915721e+00,
        4.2592230e+00,  1.8200001e+00,  2.0176919e-02,  3.6506136e-03,
       -1.6147256e+00], dtype=float32), 'num_points_in_gt': 5, 'difficulty': 0, 'group_id': 1}],

In the field 'box3d_lidar' there are also 9 values. I assume the last three values correspond to angles roll, pitch, yaw. But in theory in waymo they only labeled the yaw angle, so there should be only one value for orientation? ... or at least the other two should be zero, but they are not... hence my question, why?

c) Also, why is your pointpillars-based model on nuscenes faster than the one on waymo (31 fps vs 19 fps) ?

d) Now I am about to start training on my own dataset. Since I have very few frames to train ~ 150 frames, I want to do transfer learning, meaning I want to reuse the weights of the trained model on nuscenes, so I only want to replace and train the last layer of centerpoint, i.e. the detection head. I am looking at the file "nusc_centerpoint_pp_02voxel_circle_nms.py":

tasks = [
    dict(num_class=1, class_names=["car"]),
    dict(num_class=2, class_names=["truck", "construction_vehicle"]),
    dict(num_class=2, class_names=["bus", "trailer"]),
    dict(num_class=1, class_names=["barrier"]),
    dict(num_class=2, class_names=["motorcycle", "bicycle"]),
    dict(num_class=2, class_names=["pedestrian", "traffic_cone"]),
]

can you please explain why you put classes in different dictionaries? I have the intuition the classes in the same dictionary are similar... In my dataset I have classes car, truck, van, bus, pedestrian.... should I put them in different dictionaries as well, or can I make just one dictionary like tasks = [ dict(num_class=4, class_names=["car", "truck", "van", "bus", "pedestrian"]]

Do I need to make more changes in this portion of the config file?

    bbox_head=dict(
        # type='RPNHead',
        type="CenterHead",
        mode="3d",
        in_channels=sum([128, 128, 128]),
        norm_cfg=norm_cfg,
        tasks=tasks,
        dataset='nuscenes',
        weight=0.25,
        code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2, 1.0, 1.0],
        common_heads={'reg': (2, 2), 'height': (1, 2), 'dim':(3, 2), 'rot':(2, 2), 'vel': (2, 2)}, # (output_channel, num_conv)
        encode_rad_error_by_sin=False,
        direction_offset=0.0,
        bn=True
    )

what do these "code_weights" values represent? there are 10 values because in nuscenes there are 10 classes(?)....so in my case , 5 classes, I should specify only five "code_weights"?

tianweiy commented 3 years ago

a) "For nuScenes, we use x,y,z,intensity and timestamp. For Waymo, we use x, y, z, intensity, elongation. with optional timestamp for two frame model." However, here in the code, it seems like you only select the x, y, and intensity for nuScenes when you read the pcd file in the function read_file() from pipelines/loading.py. because num_point_feature=4 ??:

Yeah, we concatenate the timestamp later https://github.com/tianweiy/CenterPoint/blob/890e538afc95ee741f4c29b4c30b37251027e31c/det3d/datasets/pipelines/loading.py#L138

In the field 'box3d_lidar' there are also 9 values. I assume the last three values correspond to angles roll, pitch, yaw. But in theory in waymo they only labeled the yaw angle, so there should be only one value for orientation? ... or at least the other two should be zero, but they are not... hence my question, why?

Sorry for the confusion, they are not three angles. They are x,y,z,dx,dy,dz,velocity_x, velocity_y, yaw

c) Also, why is your pointpillars-based model on nuscenes faster than the one on waymo (31 fps vs 19 fps) ?

Two reasons. 1. Waymo has larger scene (150x150 vs. 100 x 100meter). 2. The Waymo model's output stride is 1 compared to 4 in nuScenes. (Waymo with output stride 4 doesn't perform well)

d) can you please explain why you put classes in different dictionaries? I have the intuition the classes in the same dictionary are similar... In my dataset I have classes car, truck, van, bus, pedestrian.... should I put them in different dictionaries as well, or can I make just one dictionary like tasks = [ dict(num_class=4, class_names=["car", "truck", "van", "bus", "pedestrian"]]

Basically, we assign different objects to different detection heads to solve the class imbalance problem in nuScenes. If your dataset is not class imbalanced, you can just use a single group like what you do above, otherwise, try to follow our group definition which proves to work well on nuScenes

Do I need to make more changes in this portion of the config file?

no need. code_weights is the weight for regression loss of 'reg', 'height', 'dim', etc.. it is not related to classes

xavidzo commented 3 years ago

Probably later on I would like to fuse camera images with the results from CenterPoint, following the state-of-the-art method in CLOCs. The available code released for now fuses the outputs of the 2D detector Cascade-RCNN with the 3D detector SECOND. https://arxiv.org/abs/2009.00784, https://github.com/pangsu0613/CLOCs In the README of the above github link, in the section "Fusion of other 3D and 2D detectors", the author says the following:

Step 3: Since the number of detection candidates are different for different 2D/3D detectors, you need to modify the corresponding parameters in the CLOCs code. Then train the CLOCs fusion. For example, there are 70400 (200x176x2) detection candidates in each frame from SECOND with batch size equals to 1. It is a very large number because SECOND is a one-stage detector, for other multi-stage detectors, you could just take the detection candidates before the final NMS function, that would reduce the number of detection candidates to hundreds or thousands.

a) How would this statement apply to CenterPoint, can you explain please? I mean, what's the number of detection candidates in each frame? Is this number always fixed / the same for all frames? and how can I take the number of detection candidates in CenterPoint before NMS is applied?

b) Update: I've been training on my custom data for more than 100 epochs. My dataset is quite small, only 150 frames. This is the performance so far: 2021-03-07 15:14:11,158 - INFO - Epoch [122/125][35/38] lr: 0.00000, eta: 0:02:21, time: 0.408, data_time: 0.070, transfer_time: 0.004, forward_time: 0.078, loss_parse_time: 0.000 memory: 4080, 2021-03-07 15:14:11,158 - INFO - task : ['car'], loss: 3.0604, hm_loss: 2.5405, loc_loss: 2.0795, loc_loss_elem: ['0.2258', '0.2351', '0.2916', '0.0908', '0.0972', '0.1771', '1.5755', '0.3259', '0.4193', '0.1623'], num_positive: 11.2000 2021-03-07 15:14:11,158 - INFO - task : ['trailer'], loss: 0.8950, hm_loss: 0.6023, loc_loss: 1.1709, loc_loss_elem: ['0.1410', '0.1478', '0.1197', '0.1316', '0.0734', '0.0514', '0.6290', '0.2809', '0.2147', '0.1092'], num_positive: 7.6000 2021-03-07 15:14:11,158 - INFO - task : ['bus'], loss: 1.3275, hm_loss: 1.1275, loc_loss: 0.8001, loc_loss_elem: ['0.1548', '0.1824', '0.0938', '0.0199', '0.0174', '0.0179', '0.0063', '0.0058', '0.2177', '0.0938'], num_positive: 12.8000 2021-03-07 15:14:11,158 - INFO - task : ['van'], loss: 0.7695, hm_loss: 0.5451, loc_loss: 0.8977, loc_loss_elem: ['0.1176', '0.1482', '0.1664', '0.0785', '0.0616', '0.0487', '0.2156', '0.3060', '0.0561', '0.1162'], num_positive: 5.8000 2021-03-07 15:14:11,158 - INFO - task : ['pedestrian'], loss: 0.3599, hm_loss: 0.2008, loc_loss: 0.6362, loc_loss_elem: ['0.2162', '0.1056', '0.1254', '0.0568', '0.0615', '0.0198', '0.0024', '0.0028', '0.0198', '0.0301'], num_positive: 8.0000

Do you have any idea why the loss for the car class does not get lower? I've seen it's mostly around 3 and 4, but for the other classes it started high and has converged rapidly till values around 1. In my dataset, the car class is the most represented, followed by the trailer class whose loss does decrease easily... I wonder why is this is not the case for car class

tianweiy commented 3 years ago

a) How would this statement apply to CenterPoint, can you explain please? I mean, what's the number of detection candidates in each frame? Is this number always fixed / the same for all frames? and how can I take the number of detection candidates in CenterPoint before NMS is applied?

Never read this and sorry that I don't have time to read this recently.

b) seems that one value is off a lot [''1.5755', '0.3259', '0.4193', '0.1623'],'] unfortuantelly I don't have much clues

xavidzo commented 3 years ago

I am aware that you are quite busy right now, but if you have some free time after ICCV, please try to answer a) of my previous comment. You don't need to read the whole CLOCs paper, only the README in the github link I posted, which would take you no more than 5 minutes. Thank you in advance tianweiy

tianweiy commented 3 years ago

For nuScenes, you will use box here https://github.com/tianweiy/CenterPoint/blob/1ecebf980f75cfe7f53cc52032b184192891c9b9/det3d/models/bbox_heads/center_head.py#L481

For Waymo (or if you want to train two stage model), you will use this https://github.com/tianweiy/CenterPoint/blob/1ecebf980f75cfe7f53cc52032b184192891c9b9/det3d/models/detectors/two_stage.py#L149

I am working on adding tutorials and code for easier training on custom datasets. I am also working on accelerating the inference a bit. I will try to merge these two asap but it may take me one week or so.

tianweiy commented 3 years ago

check my reply here https://github.com/tianweiy/CenterPoint/issues/78#issuecomment-812951128 though I believe you already finished sth.

xavidzo commented 3 years ago

Yes, I can show you in this gif how CenterPoint works after training on my own data recorded from the highway. The learned weights from nuscenes didn't help much, so I trained from scratch. Basically, you can see cars and trucks being detected and visualized in Rviz. Thanks a lot for all your support! centerpoint-highway-prov_github

Two questions, this one maybe off-topic: Is there any reason why you decided to use det3d for your work? I mean, I've seen many researchers used instead openPCDet for their implementations

About the voxelization: for nuscenes you used 30000 as the maximum number of voxels and 20 as the maximum number of points per voxel. In other projects for kitti I saw 12000 as max_voxel_num and 100 as max_points_in_voxel.... Can you please explain how these values affect the performance of the network? For my own data I chose the same config as nuscenes, but I don't know if I should change it given that my point cloud frames are quite sparse....

tianweiy commented 3 years ago

topic: Is there any reason why you decided to use det3d for your work? I mean, I've seen many researchers used instead openPCDet for their implementations

I finished up most of the work for centerpoint before OpenPCDet gets Waymo / nuScenes support. I also tried OpenPCDet for a while last summer but it trains a little slower and the number is a little lower than my implementation here so I just stick with current codebase.

tianweiy commented 3 years ago

About the voxelization: for nuscenes you used 30000 as the maximum number of voxels and 20 as the maximum number of points per voxel. In other projects for kitti I saw 12000 as max_voxel_num and 100 as max_points_in_voxel.... Can you please explain how these values affect the performance of the network? For my own data I chose the same config as nuscenes, but I don't know if I should change it given that my point cloud frames are quite sparse....

I think 100 is too large probably. 30000/12000 are set to cover most of the voxels / pillars. 20/100 are not that important and smaller number will be faster.

xavidzo commented 3 years ago

Hello, I have one question about how the DataLoader works in the training script. I want to try Dynamic Voxelization, for which I need the points as input to the network instead of the voxels. The number of features I used for each point is 4. I checked the points are stored also in the example dictionary, but somehow the dimension of the points changes after the data is passed to the DataLoader, especially the second dim from 4 to 5, please look at this:

# print statement in datasets/pipelines/formatting.py, the last step in the pipeline for loading the data
# @PIPELINES.register_module
# class Reformat(object):
#    def __init__(self, **kwargs):
#    -----------------------------------
#    def __call__(self, res, info):
#       meta = res["metadata"]
#       points = res["lidar"]["points"]
#       voxels = res["lidar"]["voxels"]      
#       print("poins shape in Reformat: ", points.shape)

points shape in Reformat: (132250, 4)

# print statement in datasets/providentia.py, my own dataset wrapper, similar to waymo.py
#    def get_sensor_data(self, idx): 
#    .....................................................
#         data, _ = self.pipeline(res, info)
#         print("points shape in get_sensor_data: ",data["points"].shape)

points shape in get_sensor_data: (132250, 4)

# print statement in torchie/trainer/trainer.py
#    def train(self, data_loader, epoch, **kwargs):
#     ......................................................
#            for i, data_batch in enumerate(data_loader):
#                   print("points shape in data_batch: ", data_batch["points"].shape)

points shape in data_batch: torch.Size([132087, 5]) # here the dimensions change, from 4 to 5

# print statement in models/detectors/point_pillars.py
#    def forward(self, example, return_loss=True, **kwargs):
#    ..........................................................
#                 print("example points shape: ", example["points"].shape)

example points shape: torch.Size([132087, 5])

Do you have any idea why the dimensions change?

tianweiy commented 3 years ago

It is padded with an index to indicate its location in a batch, see this

https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/det3d/torchie/parallel/collate.py#L141

you can remove this index

YoushaaMurhij commented 3 years ago

@tianweiy, I am facing a problem when trying to prepare WAYMO dataset for training. I followed your instructions but when invoking waymo_converter.py it throws warnings and errors related to allocation on cuda device. (I use the same conda env + waymo-open-dataset-tf-1-15-0==1.2.0 on V100). Do you have any idea how to deal with this? and how much will be the size of the generated train folder ? Thanks!

xavidzo commented 3 years ago

I have a question about how do you decide when to stop training, how do you choose the number of epochs for nuscenes and Waymo? You stop when the loss values decay below 0.5?

tianweiy commented 3 years ago

@YoushaaMurhij remember to disable gpu by CUDA_VISIBLE_DEVICES=-1 during conversion to work with the multiprocessing. I don't remember exactly but it is about 650G

tianweiy commented 3 years ago

@xavidzo never tune this. I always use 20 for nuScenes. For waymo, I often use 12 epochs to verify idea as it is too large ...

xavidzo commented 3 years ago

and regarding the point cloud range parameters, voxel size and grid_size: what's the rule for the appropriate pc_range values? I noticed that with voxel_size = (0.2, 0.2, 8), when I change the point cloud range from: pc_range = (-70.4, -70.4, -8.0, 70.4, 70.4, 0.0) to: pc_range = (-70.4, -60.4, -8.0, 70.4, 60.4, 0.0) I get an error of mismatching dimensions in the pillar encoder... what's the reason? Does it have to be a square grid size with same size in width and length?

tianweiy commented 3 years ago

The grid size (range / voxel_size) needs to be divisible by 4 / 8, I don't remember exactly

xavidzo commented 3 years ago

Ok thanks, do you have maybe some advice on how to remove redundant boxes? I notice my detector predicts sometimes bounding boxes of the same class/label with high score ( > 0.6), but these boxes overlap a lot for the same object... there is only a tiny difference in the orientation predicted Please look at this example where two boxes are overlapping for the same truck boxes_overlap

tianweiy commented 3 years ago

have you used any nms? It should be less than 0.2 overlapping? https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/configs/nusc/voxelnet/nusc_centerpoint_voxelnet_01voxel.py#L72

xavidzo commented 3 years ago

Yes, in tes_cfg I set the nms_iou_threshold to 0.2....so the smaller, the better to remove overlapping boxes? I thought a higher threshold removes more false positives as in 2D detection? Regarding the voxel size, smaller size makes the detections more precise, but the inference becomes slower, or is it only a matter of fine-tuning to a particular dataset? Also https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/configs/nusc/pp/nusc_centerpoint_pp_02voxel_two_pfn_10sweep.py#L41 For the us_layer_strides parameter, how do you find a good setting, just trying what works well? does it depend on the point cloud density, number of layers in LiDAR?

tianweiy commented 3 years ago

Yes, in tes_cfg I set the nms_iou_threshold to 0.2....so the smaller, the better to remove overlapping boxes? I thought a higher threshold removes more false positives as in 2D detection?

These are two different threshold. The smaller iou threshold will remove more boxes. And then the higher score threshold will remove more false positives

Regarding the voxel size, smaller size makes the detections more precise, but the inference becomes slower, or is it only a matter of fine-tuning to a particular dataset?

Both

For the us_layer_strides parameter, how do you find a good setting, just trying what works well? does it depend on the point cloud density, number of layers in LiDAR?

I never tuned this.

xavidzo commented 3 years ago

I have more questions: a) you said you don't tune the us_layer_strides, but this is different for nuscenes and waymo, so I assume they are not equal for a reason(?).

nuscenes: https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/configs/nusc/pp/nusc_centerpoint_pp_02voxel_two_pfn_10sweep.py#L41

waymo: https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/configs/waymo/pp/waymo_centerpoint_pp_two_cls_two_pfn_stride1_3x.py#L34

b) Some time ago you mentioned "The Waymo model's output stride is 1 compared to 4 in nuScenes. (Waymo with output stride 4 doesn't perform well)", why? Is this reflected in the lines I cited in a) or where is this difference in the code?

c) for nuscenes you use the "db_sampler" and for waymo you don't, may I know why please?

https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/configs/nusc/pp/nusc_centerpoint_pp_02voxel_two_pfn_10sweep.py#L88

Also it's written type="GT-AUG", but below is "enable=False", so you don't use GT-AUG in the end. Shouldn't it be instead "enable=True"?

d) I see also a difference in the parameter "global_rot_noise":

nuscenes https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/configs/nusc/pp/nusc_centerpoint_pp_02voxel_two_pfn_10sweep.py#L127

waymo https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/configs/waymo/pp/waymo_centerpoint_pp_two_cls_two_pfn_stride1_3x.py#L85

why are these values different and how does this "global_rot_noise" affect the training process?

e) I quite not understand how db_sampler works:

    sample_groups=[
        dict(car=2),
        dict(truck=3),
        dict(construction_vehicle=7),
        dict(bus=4),
        dict(trailer=6),
        dict(barrier=2),
        dict(motorcycle=6),
        dict(bicycle=6),
        dict(pedestrian=2),
        dict(traffic_cone=2),
    ],

does it mean from every frame, you add more samples of ground-truth classes according to this distribution? so you add 2 cars more, 3 trucks more, and so on... to every frame? what's a good strategy? have higher numbers for classes that are less represented in the dataset or how do you set those numbers?

tianweiy commented 3 years ago

a) b) Some time ago you mentioned "The Waymo model's output stride is 1 compared to 4 in nuScenes. (Waymo with output stride 4 doesn't perform well)", why? Is this reflected in the lines I cited in a) or where is this difference in the code?

Yes, it is defined by the code above. Waymo is 1 following a few papers in the past. nuScenes is just 4(inherited from det3d). I tried 4 for Waymo, it doesn't work very well.

c) Waymo gets millions of cars and pedestrians so for the two class model, we don't use db_sampler. We do use db_sampler for three class model as the additional cyclist are rare.

d) still following original parameters. I remembered pi/4 for nuScenes also work and slightly better(<0.5)

e) Yes, copied from det3d

xavidzo commented 3 years ago

I am trying to train a two-stage model, but somehow I get some error of mismatching dimensions: Traceback (most recent call last): File "./tools/train.py", line 137, in main() File "./tools/train.py", line 132, in main logger=logger, File "/home/CenterPoint/det3d/torchie/apis/train.py", line 327, in train_detector trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank) File "/home/CenterPoint/det3d/torchie/trainer/trainer.py", line 543, in run epoch_runner(data_loaders[i], self.epoch, kwargs) File "/home/CenterPoint/det3d/torchie/trainer/trainer.py", line 410, in train self.model, data_batch, train_mode=True, kwargs File "/home/CenterPoint/det3d/torchie/trainer/trainer.py", line 368, in batch_processor_inline losses = model(example, return_loss=True) File "/home/miniconda/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/CenterPoint/det3d/models/detectors/two_stage.py", line 186, in forward batch_dict = self.roi_head(example, training=return_loss) File "/home/miniconda/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/CenterPoint/det3d/models/roi_heads/roi_head.py", line 89, in forward shared_features = self.shared_fc_layer(pooled_features.view(batch_size_rcnn, -1, 1)) File "/home/miniconda/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/miniconda/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/miniconda/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/miniconda/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 263, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/miniconda/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 260, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Given groups=1, weight of size [256, 1920, 1], expected input[1024, 320, 1] to have 1920 channels, but got 320 channels instead Killing subprocess 316172

This is my model:

# model settings
model = dict(
    type='TwoStageDetector',
    first_stage_cfg=dict(
        type="PointPillars",
        pretrained='work_dirs/nusc_centerpoint_pp_02voxel_two_pfn_10sweep_one_head_no_vel_040vox/epoch_600.pth',
        reader=dict(
            type="PillarFeatureNet",
            num_filters=[64, 64],
            num_input_features=4,
            with_distance=False,
            voxel_size=(0.40, 0.40, 4),
            pc_range=(-70.4, -70.4, -8.0, 70.4, 70.4, -4.0),
        ),
        backbone=dict(type="PointPillarsScatter", ds_factor=1),
        neck=dict(
            type="RPN",
            layer_nums=[3, 5, 5],
            ds_layer_strides=[2, 2, 2],
            ds_num_filters=[64, 128, 256],
            us_layer_strides=[0.5, 1, 2],
            us_num_filters=[128, 128, 128],
            num_input_features=64,
            logger=logging.getLogger("RPN"),
        ),
        bbox_head=dict(
            type="CenterHead",
            in_channels=128*3,
            tasks=tasks,
            dataset='providentia',
            weight=0.25,
            code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
            common_heads={'reg': (2, 2), 'height': (1, 2), 'dim':(3, 2), 'rot':(2, 2)}, # (output_channel, num_conv)
        ),
    ),
    second_stage_modules=[
        dict(
            type="BEVFeatureExtractor",
            pc_start=[-70.4, -70.4],
            voxel_size=[0.40, 0.40],
            out_stride=1
        )
    ],
    roi_head=dict(
        type="RoIHead",
        input_channels=128*3*5,
        model_cfg=dict(
            CLASS_AGNOSTIC=True,
            SHARED_FC=[256, 256],
            CLS_FC=[256, 256],
            REG_FC=[256, 256],
            DP_RATIO=0.3,

            TARGET_CONFIG=dict(
                ROI_PER_IMAGE=128,
                FG_RATIO=0.5,
                SAMPLE_ROI_BY_EACH_CLASS=True,
                CLS_SCORE_TYPE='roi_iou',
                CLS_FG_THRESH=0.75,
                CLS_BG_THRESH=0.25,
                CLS_BG_THRESH_LO=0.1,
                HARD_BG_RATIO=0.8,
                REG_FG_THRESH=0.55
            ),
            LOSS_CONFIG=dict(
                CLS_LOSS='BinaryCrossEntropy',
                REG_LOSS='L1',
                LOSS_WEIGHTS={
                'rcnn_cls_weight': 1.0,
                'rcnn_reg_weight': 1.0,
                'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
                }
            )
        ),
        code_size=7
    ),
    NMS_POST_MAXSIZE=500,
    num_point=5,
    freeze=True
)

Could you please spot what might be wrong?

tianweiy commented 3 years ago

change

        input_channels=128*3*5,

to 320

xavidzo commented 3 years ago

thank you very much, it worked, but why is it 128 3 5 in waymo model or 320 in my case?

tianweiy commented 3 years ago

uhm, I just saw from your log that it needs to be 320

tianweiy commented 3 years ago

RuntimeError: Given groups=1, weight of size [256, 1920, 1], expected input[1024, 320, 1] to have 1920 channels, but got 320 channels instead

xavidzo commented 3 years ago

I tried the second_stage in my model, but it did not improve, in fact, the performance was worse than single_stage...

Regarding the loss in this line https://github.com/tianweiy/CenterPoint/blob/cb25e870b271fe8259e91c5d17dcd429d74abc91/det3d/models/bbox_heads/center_head.py#L278

I see for nuscenes the weight you give to the loc_loss is 0.25, where as for waymo you give a weight of 2 Does it mean in waymo you care more about the localization of objects more than class regression and in nuscenes is the other way around? What's the effect of changing the value of this weight?

Maybe do you have a tip to improve the prediction of the heading angle (aka orientation). I can only get decent results after training for many epochs (> 200)...I guess there must be a more simple way without overfitting the training data...

tianweiy commented 3 years ago

Does it mean in waymo you care more about the localization of objects more than class regression and in nuscenes is the other way around?

nuScenes map is less strict than Waymo (measure 2d distance instead of 3d iou) so we use a smaller weight for regression.

What's the effect of changing the value of this weight?

I do line search for nuScenes weight, 2 -> 1 -> 0.5 -> 0.25 current one works the best and maybe a few percent higher than the original one but I don't get number for the current codebase.

Maybe do you have a tip to improve the prediction of the heading angle (aka orientation). I can only get decent results after training for many epochs (> 200)...I guess there must be a more simple way without overfitting the training data...

I don't know. I would be interested if you find anything

xavidzo commented 3 years ago

Hello @tianweiy, I wanted to comment I got much better orientation predictions results for my dataset after I changed the parameter global_rot_noise from [-0.3925, 0.3925], to [0.0, 0.0]

Now I have another problem, maybe you know what could be the reason: When I do inference with the original pcd files of my dataset, the network predicts very good bounding box detections However, when I transform my pcd files (I only translate the z coordinate of each point in the point cloud to change the range from [-8, 0] to [0, +8] ). Then, I adjust the pc_range and voxel_range accordingly in the config file of the network to match the changes in the z coordinate, but I get zero predictions, no results at all on these transformed pcd files. Do you have an idea of what I am missing?

tianweiy / CenterPoint

Help with inference/training for own test data #56