Open xavidzo opened 3 years ago
Hi, thanks for your interest.
I will have a large update to the codebase in the coming weeks. And I will make sure to add some code for own data inference in the update.
hello, there is a python file in tools/single_inference.py, but i do not understand the code in it:
sub_lidar_topic = [ "/velodyne_points",
"/top/rslidar_points",
"/points_raw",
"/lidar_protector/merged_cloud",
"/merged_cloud",
"/lidar_top",
"/roi_pclouds"]
sub_ = rospy.Subscriber(sub_lidar_topic[5], PointCloud2, rslidar_callback, queue_size=1, buff_size=2**24)
what does the "/roi_pclouds" mean? Is there any tools for transfering the topic "/velodyne_points" to "/roi_pclouds"? or can I use "/velodyne_points" directly?
@huster-wugj this is a community contribution. You can find details here https://github.com/tianweiy/CenterPoint/pull/11 You can't directly use /velodyne_points currently. I will add script to support this later.
This file represents a ROS node (robotic operating system) for inference using point clouds in ROS. "/roi_pclouds" is the name of the topic the you need to subscribe to , in order to receive a point cloud in ROS. Please note that this code is not oriented for inference from .bin or .pcd files!
You can use something like this for .bin files:
def load_cloud_from_nuscenes_file(pc_f):
logging.info('loading cloud from: {}'.format(pc_f))
num_features = 5
cloud = np.fromfile(pc_f, dtype=np.float32, count=-1).reshape([-1, num_features])
# last dimension should be the timestamp.
cloud[:, 4] = 0
return cloud
@staticmethod
def load_cloud_from_deecamp_file(pc_f):
logging.info('loading cloud from: {}'.format(pc_f))
num_features = 4
cloud = np.fromfile(pc_f, dtype=np.float32, count=-1).reshape([-1, num_features])
# last dimension should be the timestamp.
cloud = np.hstack((cloud, np.zeros([cloud.shape[0], 1])))
return cloud
def predict_on_local_file(self, cloud_file, i):
# load sample from file
#self.points = self.load_cloud_from_nuscenes_file(cloud_file)
#self.points = self.load_cloud_from_deecamp_file(cloud_file)
self.points = self.load_cloud_from_my_file(cloud_file)
# prepare input
voxels, coords, num_points = self.voxel_generator.generate(self.points)
num_voxels = np.array([voxels.shape[0]], dtype=np.int64)
grid_size = self.voxel_generator.grid_size
coords = np.pad(coords, ((0, 0), (1, 0)), mode='constant', constant_values=0)
voxels = torch.tensor(voxels, dtype=torch.float32, device=self.device)
coords = torch.tensor(coords, dtype=torch.int32, device=self.device)
num_points = torch.tensor(num_points, dtype=torch.int32, device=self.device)
num_voxels = torch.tensor(num_voxels, dtype=torch.int32, device=self.device)
self.inputs = dict(
voxels=voxels,
num_points=num_points,
num_voxels=num_voxels,
coordinates=coords,
shape=[grid_size]
)
# predict
torch.cuda.synchronize()
tic = time.time()
with torch.no_grad():
outputs = self.net(self.inputs, return_loss=False)[0]
torch.cuda.synchronize()
logging.info("Prediction time: {:.3f} s".format(time.time() - tic))
for k, v in outputs.items():
if k not in [
"metadata",
]:
outputs[k] = v.to('cpu')
# visualization
#print(outputs)
visual_detection(i, np.transpose(self.points), outputs, conf_th=0.5, show_plot=False, show_3D=False)
it's just a ROS topic name, you can change it if have your own rosbag data.
I guess the problem is solved. Reopen if you still have some issues
Nevertheless, for me, it is still unclear how I can use centerpoint for training on my own custom dataset Can you guys provide more detailed steps, please? For training do I have to label data in the same format as nuscenes or waymo? Should I adapt the "def create_groundtruth_database" function from det3d.datasets.utils.create_gt_database and define a new function inside "tools/create_data.py" similar to "def nuscenes_data_prep"?
Why do you generate pickle files?
In https://github.com/tianweiy/CenterPoint/blob/master/docs/NUSC.md you mentioned for training the "tools/dist_test.py" should be used... Then what's the purpose of the "tools/train.py" script?
Excuse me, perhaps my python skills are not advanced, that's why I cannot see how I can apply centerpoint to my own lidar pcd files.
(This issue was about inference, now I mean training. If you consider I should open a new issue, just let me know)
For training do I have to label data in the same format as nuscenes or waymo?
I think both of these are difficult. I think the easiest way is to look at my waymo dataset class and write a wrapper like this (please refer to the waymo.py and loading.py files). You want to have lidar data and then also label the objects in the lidar frame with 7 dim boxes and object class. If you also want to do temporal stuff (multi-sweep model), this will be hard and need some sensor calibrations which I am not familiar with.
Should I adapt the "def create_groundtruth_database" function from det3d.datasets.utils.create_gt_database and define a new function inside "tools/create_data.py" similar to "def nuscenes_data_prep"?
Yes..
Why do you generate pickle files?
To save the calibration/file paths/labels all into a single file for efficient accessing.
In https://github.com/tianweiy/CenterPoint/blob/master/docs/NUSC.md you mentioned for training the "tools/dist_test.py" should be used... Then what's the purpose of the "tools/train.py" script?
dist_test for testing. train for train. Where did I mention that dist_test is for training?
Let me know if you have further questions.
Best
Ok thanks a lot for your helpful indications! I have two questions more:
So basically, after training an image segmentation network and projecting the point cloud points to images in order to find a correspondence between lidar points and pixels, I could figure out the semantic segmentation scores assigned to each lidar point. Then the point-cloud dimensionality must be augmented with these scores. (4 values: x, y, x, intensity plus C class scores). I want to do this with CenterPoint. Can you please give me a hint, where in the codebase I can change the input size of the PointPillars or VoxelNet backbones to realize this? Maybe another adaptions to the codebase I should consider for this PointPainting approach?
Thanks a lot @muzi2045
For 1, I don't have much experience.
For 2, actually, now https://github.com/tianweiy/CenterPoint/blob/fc9c9eef5534486c8a3b9e35c4682822bddc3498/det3d/datasets/pipelines/loading.py#L23 get a painted option. I guess you will first paint the points with a segmentation mask, and you can directly use these painted point clouds (only need to change the input_num_features to the corresponding channel number in the configs).
I implemented pointpainting for nuScenes in the past. You will need those calibrations matrixes for camera / lidar and do some good synchronization, which I feel is not trivial. You can get some idea from the following code that uses a mmdetection3d nuImages model. It is quite slow though. So you need to do some optimization.
import argparse
from nuscenes.utils.geometry_utils import view_points
import numpy as np
import numba
import pickle
from tqdm import tqdm
from mmdet.apis import inference_detector, init_detector
import torch
import time
import mmcv
import platform
from functools import partial
import os
from det3d.torchie.parallel import collate, collate_kitti
from det3d.torchie.trainer import get_dist_info
from torch.utils.data import DataLoader
from det3d.datasets.loader.sampler import (
DistributedGroupSampler,
DistributedSampler,
DistributedSamplerV2,
GroupSampler,
)
from collections import defaultdict
from mmdet.datasets.pipelines import LoadImageFromFile
from det3d import torchie
if platform.system() != "Windows":
# https://github.com/pytorch/pytorch/issues/973
import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (4096, rlimit[1]))
FILE_CLIENT = mmcv.FileClient(backend='disk')
def load_image(filename):
global FILE_CLIENT
img_bytes = FILE_CLIENT.get(filename)
img = mmcv.imfrombytes(img_bytes, flag='color')
img = img.astype(np.float32)
return img
def simple_collate(batch_list):
ret_dict = defaultdict(list)
for i in range(len(batch_list)):
ret_dict['lidars'].append(batch_list[i]['lidars'])
ret_dict['imgs'].append(batch_list[i]['imgs'])
ret_dict['sweeps'].append(batch_list[i]['sweeps'])
return ret_dict
def build_dataloader(
dataset, batch_size, workers_per_gpu, num_gpus=1, dist=True, **kwargs
):
data_loader = DataLoader(
dataset,
batch_size=1,
sampler=None,
shuffle=False,
num_workers=4,
collate_fn=simple_collate,
pin_memory=False,
)
return data_loader
def get_time():
torch.cuda.synchronize()
return time.time()
def read_file(path, tries=2, num_point_feature=5):
points = None
try_cnt = 0
points = np.fromfile(path, dtype=np.float32).reshape(-1, 5)[:, :num_point_feature]
return points
def get_obj(path):
with open(path, 'rb') as f:
obj = pickle.load(f)
return obj
def to_tensor(x, device='cuda:0'):
return torch.tensor(x, dtype=torch.float32).to(device)
def view_points(points, view, normalize, device='cuda:0'):
"""
This is a helper class that maps 3d points to a 2d plane. It can be used to implement both perspective and
orthographic projections. It first applies the dot product between the points and the view. By convention,
the view should be such that the data is projected onto the first 2 axis. It then optionally applies a
normalization along the third dimension.
For a perspective projection the view should be a 3x3 camera matrix, and normalize=True
For an orthographic projection with translation the view is a 3x4 matrix and normalize=False
For an orthographic projection without translation the view is a 3x3 matrix (optionally 3x4 with last columns
all zeros) and normalize=False
:param points: <np.float32: 3, n> Matrix of points, where each point (x, y, z) is along each column.
:param view: <np.float32: n, n>. Defines an arbitrary projection (n <= 4).
The projection should be such that the corners are projected onto the first 2 axis.
:param normalize: Whether to normalize the remaining coordinate (along the third axis).
:return: <np.float32: 3, n>. Mapped point. If normalize=False, the third coordinate is the height.
"""
assert view.shape[0] <= 4
assert view.shape[1] <= 4
assert points.shape[0] == 3
viewpad = torch.eye(4).to(device)
viewpad[:view.shape[0], :view.shape[1]] = view
nbr_points = points.shape[1]
# Do operation in homogenous coordinates.
points = torch.cat((points, torch.ones([1, nbr_points]).to(device)), dim=0)
points = torch.matmul(viewpad, points)
points = points[:3, :]
if normalize:
points = points / points[2:3, :].repeat(3, 1).reshape(3, nbr_points)
return points
@torch.no_grad()
def paint(points, segmentations, all_cams_from_lidar, all_cams_intrinsic, num_cls=10, device='cuda:0'):
num_lidar_point = points.shape[0]
num_camera = len(all_cams_from_lidar)
H, W, num_cls = segmentations[0].shape
# we keep a final mask to keep track all points that can project into an image
# we set a point's rgb value to (0, 0, 0) if no projection is possible
# not quite sure how to deal with points that appear in multiple cameras. Now we include all of them and take the average
final_mask = torch.zeros([num_lidar_point], dtype=torch.bool).to(device)
segmentation_map = torch.zeros([num_lidar_point, num_cls], dtype=torch.float32).to(device)
num_occurence = torch.zeros([num_lidar_point], dtype=torch.float32).to(device) # keep track of how many pixels a point is projected into
points = to_tensor(points)
for i in range(num_camera):
tm = to_tensor(all_cams_from_lidar[i])
intrinsic = to_tensor(all_cams_intrinsic[i])
segmentation = to_tensor(segmentations[i])
# transfer to camera frame
transform_points = points.clone().transpose(1, 0)
transform_points[:3, :] = torch.matmul(tm, torch.cat([
transform_points[:3, :],
torch.ones(1, num_lidar_point, dtype=torch.float32, device=device)
], dim=0)
)[:3, :]
depths = transform_points[2, :]
# take 2d image of the lidar point
points_2d = view_points(transform_points[:3, :], intrinsic, normalize=True).transpose(1, 0)[:, :2]
points_seg = torch.zeros([num_lidar_point, num_cls], dtype=points.dtype, device=device)
# N x 2
points_2d = torch.floor(points_2d)
assert H == 900, W == 1600
points_x, points_y = points_2d[:, 0].long(), points_2d[:, 1].long()
valid_mask = (points_x > 1) & (points_x < W-1) & (points_y >1) & (points_y < H-1)
points_seg[valid_mask] = segmentation[points_y[valid_mask], points_x[valid_mask]]
# ensure the points are in front of the camera
mask = valid_mask & (depths > 0)
segmentation_map[mask] += points_seg[mask]
final_mask = mask | final_mask
num_occurence += mask.float()
# for points that have at least one projection, we compute the average segmentation score
projection_points = points[final_mask]
segmentation_map = segmentation_map[final_mask] / num_occurence[final_mask].unsqueeze(-1)
projection_points = torch.cat([projection_points, segmentation_map], dim=1)
# for points that are not in any images, we assign background class to the point
no_projection_points = points[torch.logical_not(final_mask)]
fake_seg = torch.zeros([no_projection_points.shape[0], num_cls], dtype=torch.float32, device=device)
no_projection_points = torch.cat([no_projection_points, fake_seg], dim=1)
points = torch.cat([projection_points, no_projection_points], dim=0)
return points
def process_seg_result(result):
# 10 x NUM_BOX x H x W
segs = []
for i in range(len(result)):
per_cls_seg = result[i]
if len(per_cls_seg) == 0:
per_cls_seg = np.zeros((900, 1600))
else:
per_cls_seg = np.stack(per_cls_seg, axis=0)
# merge all box's instance mask
# if one pixel is in any of the boxes, we will set this pixel's value to 1
# otherwise set it to zero
# N H W
per_cls_seg = np.mean(per_cls_seg, axis=0)
per_cls_seg = (per_cls_seg>0).astype(np.float32)
segs.append(per_cls_seg)
# 10 x H x W
seg_mask = np.stack(segs, axis=0)
seg_mask = np.transpose(seg_mask, (1, 2, 0)) # H x W x 10
return seg_mask
from torch.utils.data import Dataset
class PaintData(Dataset):
def __init__(
self,
info_path,
start=-1,
end=-1
):
if end == -1:
self.infos = get_obj(info_path)[start:]
elif start == -1:
self.infos = get_obj(info_path)[:end+1]
else:
self.infos = get_obj(info_path)[start:end+1]
self._set_group_flag()
def __getitem__(self, index):
info = self.infos[index]
sweeps = [info] + info['sweeps']
lidars = []
imgs = []
assert len(sweeps) == 10
for i in range(10):
sweep = sweeps[i]
lidar_path = sweep['lidar_path']
lidar = read_file(lidar_path)
lidars.append(lidar)
all_cams_path = sweep['all_cams_path']
img = []
for path in all_cams_path:
img.append(load_image(path))
# img = np.stack(img, axis=0)
imgs.append(img)
ret_dict = {
'lidars': lidars,
'sweeps': sweeps,
'imgs': imgs
}
return ret_dict
def __len__(self):
return len(self.infos)
def _set_group_flag(self):
"""Set flag according to image aspect ratio.
Images with aspect ratio greater than 1 will be set as group 1,
otherwise group 0.
"""
self.flag = np.ones(len(self), dtype=np.uint8)
# self.flag = np.zeros(len(self), dtype=np.uint8)
# for i in range(len(self)):
# img_info = self.img_infos[i]
# if img_info['width'] / img_info['height'] > 1:
# self.flag[i] = 1
@torch.no_grad()
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--info_path", type=str)
parser.add_argument('--config', help='Config file')
parser.add_argument('--checkpoint', help='Checkpoint file')
parser.add_argument('--device', default='cuda:0', type=str)
parser.add_argument('--start', type=int, default=0, )
parser.add_argument('--end', type=int)
args = parser.parse_args()
dataset = PaintData(args.info_path, args.start, args.end)
model = init_detector(args.config, args.checkpoint, device=args.device)
loader = build_dataloader(dataset, 1, 4, dist=False)
prog_bar = torchie.ProgressBar(len(dataset))
for _, data_batch in enumerate(loader):
lidars, sweeps, imgs = data_batch['lidars'], data_batch['sweeps'], data_batch['imgs']
lidars = lidars[0]
sweeps = sweeps[0]
imgs = imgs[0]
for i in range(10):
sweep = sweeps[i]
lidar_path = sweep['lidar_path']
lidar = lidars[i]
img = imgs[i]
all_cams_from_lidar = sweep['all_cams_from_lidar']
all_cams_intrinsic = sweep['all_cams_intrinsic']
all_cams_path = sweep['all_cams_path']
instance_seg_results = []
for j in range(6):
result = inference_detector(model, img[j])[1]
seg_mask = process_seg_result(result)
instance_seg_results.append(seg_mask)
# run painting
painted_points = paint(lidar, instance_seg_results, all_cams_from_lidar, all_cams_intrinsic)
dir_path = os.path.join(*lidar_path.split('/')[:-2], 'stronger_painted_'+lidar_path.split('/')[-2])
if not os.path.isdir(dir_path):
os.mkdir(dir_path)
painted_path = os.path.join(dir_path, lidar_path.split('/')[-1])
np.save(painted_path, painted_points.cpu().numpy())
prog_bar.update()
if __name__ == '__main__':
main()
Please correct me if I am wrong: a file like "infos_train_01sweeps_filter_zero_gt.pkl", has saved the aggregated ground truth information for all training frames at once, right? What's the meaning of the "filter_zero" in the filename? What's the difference between "dbinfos_train_1sweeps_withvelo.pkl" and "infos_train_01sweeps_filter_zero_gt.pkl"? Is it mandatory to create these files for a simple custom dataset?
In my case, I have all pcd files in one folder "pointclouds" and in another folder "annotations" I have the corresponding .json files with the labeled vehicles for each pcd / frame. One such .json file looks for example as follows:
{"name":"000000","timestamp":0,"index":0,"labels":[{"id":0,"category":"car","box3d":{"dimension":{"width":3.52,"length":1.9000000000000001,"height":1.58},"location":{"x":13.990062100364485,"y":17.122656041496878,"z":-6.110000000000003},"orientation":{"rotationYaw":0,"rotationPitch":0,"rotationRoll":0}}},{"id":1,"category":"car","box3d":{"dimension":{"width":3.41,"length":1.68,"height":1.58},"location":{"x":27.212441928732442,"y":-2.479418392114875,"z":-6.110000000000001},"orientation":{"rotationYaw":0,"rotationPitch":0,"rotationRoll":0}}}
So my guess is I should create one "info_train.pkl" that contains at least the aggregated data of all .json files, plus the filepath directory to all pcd files. Do I need to store more information in this pkl? I think I do not need to worry about frame transformations of any kind. My scenario is different from the ego-vehicle case since the lidar sensor was mounted on a bridge and remained static pointing to the highway. The global frame is in my case also the lidar frame. In the next image, a recording sequence is shown:
I checked in nuScenes or Waymo, their labels' structure is very complex, they include far many more attributes than I have in the .json labels for my custom data. I see you include "num_points_in_gt" in the info.pkl (I assume that means "number of points inside each ground truth bounding box"). I don't have such attribute in my labels.
Based on my .json example above, will the object data attributes I collected be enough to perform training, or am I missing something else?
What's the meaning of the "filter_zero" in the filename?
filter out boxes with no lidar points
What's the difference between "dbinfos_train_1sweeps_withvelo.pkl" and "infos_train_01sweeps_filter_zero_gt.pkl"? Is it mandatory to create these files for a simple custom dataset?
infos for annotations, dbinfos for gt database. If you want to use gt aug, you need to generate dbinfos like we do in the create_data.py
I think the collected data are enough. 'num_points_in_gt' is not needed.
Hi, I have a couple more questions, if you could answer them please:
do you make the same transform for nuscenes as well? I don't see the same operation here: https://github.com/tianweiy/CenterPoint/blob/84fde67ac35a85b0364e10e75bfdf789d578024f/det3d/datasets/nuscenes/nusc_common.py#L403
What is the sweeps attribute useful for? Why is it 10 on nuscenes and 1 or 2 on waymo? For using sweeps, each lidar frame must have a timestamp, otherwise it cannot be implemented, right?
It is precomputed path for lidar frames before the current referenced frame. It is used to aggregate multiframe lidar data. nuScenes lidar is 32 lanes and Waymo is 64 lanes. As nuScenes produces really sparse pointcloud, we use more frames to aggregate data. For Waymo, we can also do 5 or ten but it is slower. Yes
I assume both nuscenes and waymo use 5 lidar features, which in the code translates to num_point_features = 5 = x, y, z, intensity and t (timestamp), correct ? But here in the code it seems that at least for nuscenes you only load the first 4, so x, y, z, and intensity:
For nuScenes, we use x,y,z,intensity and timestamp. For Waymo, we use x, y, z, intensity, elongation. with optional timestamp for two frame model.
do you make the same transform for nuscenes as well? I don't see the same operation here:
This is already done in the construction of nuScenes info file.
I read in waymo they labeled the bounding boxes' orientation only with the yaw angle, but when I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why?
Not quite sure about 'I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why' I think there is only one value?
To be more specific, in my custom dataset objects are labeled like x, y, z, width (corresponds to x dimension), length (corresponds to y dimension), and height ( = z obvious). Do I need to transform something?
Please use x, y, z, length (corresponds to y), width(x), z, rotation
thanks a lot tianweiy as always for taking your time to reply, still have some questions:
a) "For nuScenes, we use x,y,z,intensity and timestamp. For Waymo, we use x, y, z, intensity, elongation. with optional timestamp for two frame model." However, here in the code, it seems like you only select the x, y, and intensity for nuScenes when you read the pcd file in the function read_file() from pipelines/loading.py. because num_point_feature=4 ??:
def read_file(path, tries=2, num_point_feature=4, painted=False):
if painted:
//////////////
else:
points = np.fromfile(path, dtype=np.float32).reshape(-1, 5)[:, :num_point_feature]
return points
b) "Not quite sure about 'I looked at the info_data there are still 3 different values for orientation in the gt_boxes field, why' I think there is only one value?". For instance, this a sample of Waymo's infos_train_01sweeps_filter_zero_gt.pkl:
[{'path': 'data/Waymo/train/lidar/seq_0_frame_0.pkl', 'anno_path': 'data/Waymo/train/annos/seq_0_frame_0.pkl', 'token': 'seq_0_frame_0.pkl', 'timestamp': 1553629304.2740898, 'sweeps': [], 'gt_boxes': array([[-7.75765467e+00, 9.75379848e+00, 2.81035209e+00,
5.00253379e-01, 1.01573773e-01, 4.49999988e-01,
-3.18337395e-03, 3.55915539e-03, -1.53220856e+00],
[ 2.29385357e+01, -6.12172985e+00, 2.50068998e+00,
4.72529471e-01, 6.68555871e-02, 5.29999971e-01,
-1.27970241e-03, -1.58406922e-03, -4.62591171e+00]
......................................................
......................................................
In the above sample, in the field 'gt_boxes' there are 9 values for each box, not just 7... An here is a sample of Waymo's dbinfos_train_1sweeps_withvelo.pkl :
{'VEHICLE': [{'name': 'VEHICLE', 'path': 'gt_database_1sweeps_withvelo/VEHICLE/0_VEHICLE_2.bin', 'image_idx': 0, 'gt_idx': 2, 'box3d_lidar': array([-5.8496891e+01, 4.4255123e+00, 2.4241050e-01, 1.7933178e+00,
3.9103422e+00, 1.4900000e+00, -1.5062687e-01, -1.1980739e-02,
1.4906577e+00], dtype=float32), 'num_points_in_gt': 51, 'difficulty': 0, 'group_id': 0}, {'name': 'VEHICLE', 'path': 'gt_database_1sweeps_withvelo/VEHICLE/0_VEHICLE_3.bin', 'image_idx': 0, 'gt_idx': 3, 'box3d_lidar': array([ 3.7829960e+01, -5.1944214e-01, 1.1195552e+00, 1.8915721e+00,
4.2592230e+00, 1.8200001e+00, 2.0176919e-02, 3.6506136e-03,
-1.6147256e+00], dtype=float32), 'num_points_in_gt': 5, 'difficulty': 0, 'group_id': 1}],
In the field 'box3d_lidar' there are also 9 values. I assume the last three values correspond to angles roll, pitch, yaw. But in theory in waymo they only labeled the yaw angle, so there should be only one value for orientation? ... or at least the other two should be zero, but they are not... hence my question, why?
c) Also, why is your pointpillars-based model on nuscenes faster than the one on waymo (31 fps vs 19 fps) ?
d) Now I am about to start training on my own dataset. Since I have very few frames to train ~ 150 frames, I want to do transfer learning, meaning I want to reuse the weights of the trained model on nuscenes, so I only want to replace and train the last layer of centerpoint, i.e. the detection head. I am looking at the file "nusc_centerpoint_pp_02voxel_circle_nms.py":
tasks = [
dict(num_class=1, class_names=["car"]),
dict(num_class=2, class_names=["truck", "construction_vehicle"]),
dict(num_class=2, class_names=["bus", "trailer"]),
dict(num_class=1, class_names=["barrier"]),
dict(num_class=2, class_names=["motorcycle", "bicycle"]),
dict(num_class=2, class_names=["pedestrian", "traffic_cone"]),
]
can you please explain why you put classes in different dictionaries? I have the intuition the classes in the same dictionary are similar... In my dataset I have classes car, truck, van, bus, pedestrian.... should I put them in different dictionaries as well, or can I make just one dictionary like tasks = [ dict(num_class=4, class_names=["car", "truck", "van", "bus", "pedestrian"]]
Do I need to make more changes in this portion of the config file?
bbox_head=dict(
# type='RPNHead',
type="CenterHead",
mode="3d",
in_channels=sum([128, 128, 128]),
norm_cfg=norm_cfg,
tasks=tasks,
dataset='nuscenes',
weight=0.25,
code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2, 1.0, 1.0],
common_heads={'reg': (2, 2), 'height': (1, 2), 'dim':(3, 2), 'rot':(2, 2), 'vel': (2, 2)}, # (output_channel, num_conv)
encode_rad_error_by_sin=False,
direction_offset=0.0,
bn=True
)
what do these "code_weights" values represent? there are 10 values because in nuscenes there are 10 classes(?)....so in my case , 5 classes, I should specify only five "code_weights"?
a) "For nuScenes, we use x,y,z,intensity and timestamp. For Waymo, we use x, y, z, intensity, elongation. with optional timestamp for two frame model." However, here in the code, it seems like you only select the x, y, and intensity for nuScenes when you read the pcd file in the function read_file() from pipelines/loading.py. because num_point_feature=4 ??:
Yeah, we concatenate the timestamp later https://github.com/tianweiy/CenterPoint/blob/890e538afc95ee741f4c29b4c30b37251027e31c/det3d/datasets/pipelines/loading.py#L138
In the field 'box3d_lidar' there are also 9 values. I assume the last three values correspond to angles roll, pitch, yaw. But in theory in waymo they only labeled the yaw angle, so there should be only one value for orientation? ... or at least the other two should be zero, but they are not... hence my question, why?
Sorry for the confusion, they are not three angles. They are x,y,z,dx,dy,dz,velocity_x, velocity_y, yaw
c) Also, why is your pointpillars-based model on nuscenes faster than the one on waymo (31 fps vs 19 fps) ?
Two reasons. 1. Waymo has larger scene (150x150 vs. 100 x 100meter). 2. The Waymo model's output stride is 1 compared to 4 in nuScenes. (Waymo with output stride 4 doesn't perform well)
d) can you please explain why you put classes in different dictionaries? I have the intuition the classes in the same dictionary are similar... In my dataset I have classes car, truck, van, bus, pedestrian.... should I put them in different dictionaries as well, or can I make just one dictionary like tasks = [ dict(num_class=4, class_names=["car", "truck", "van", "bus", "pedestrian"]]
Basically, we assign different objects to different detection heads to solve the class imbalance problem in nuScenes. If your dataset is not class imbalanced, you can just use a single group like what you do above, otherwise, try to follow our group definition which proves to work well on nuScenes
Do I need to make more changes in this portion of the config file?
no need. code_weights is the weight for regression loss of 'reg', 'height', 'dim', etc.. it is not related to classes
Probably later on I would like to fuse camera images with the results from CenterPoint, following the state-of-the-art method in CLOCs. The available code released for now fuses the outputs of the 2D detector Cascade-RCNN with the 3D detector SECOND. https://arxiv.org/abs/2009.00784, https://github.com/pangsu0613/CLOCs In the README of the above github link, in the section "Fusion of other 3D and 2D detectors", the author says the following:
Step 3: Since the number of detection candidates are different for different 2D/3D detectors, you need to modify the corresponding parameters in the CLOCs code. Then train the CLOCs fusion. For example, there are 70400 (200x176x2) detection candidates in each frame from SECOND with batch size equals to 1. It is a very large number because SECOND is a one-stage detector, for other multi-stage detectors, you could just take the detection candidates before the final NMS function, that would reduce the number of detection candidates to hundreds or thousands.
a) How would this statement apply to CenterPoint, can you explain please? I mean, what's the number of detection candidates in each frame? Is this number always fixed / the same for all frames? and how can I take the number of detection candidates in CenterPoint before NMS is applied?
b) Update: I've been training on my custom data for more than 100 epochs. My dataset is quite small, only 150 frames. This is the performance so far: 2021-03-07 15:14:11,158 - INFO - Epoch [122/125][35/38] lr: 0.00000, eta: 0:02:21, time: 0.408, data_time: 0.070, transfer_time: 0.004, forward_time: 0.078, loss_parse_time: 0.000 memory: 4080, 2021-03-07 15:14:11,158 - INFO - task : ['car'], loss: 3.0604, hm_loss: 2.5405, loc_loss: 2.0795, loc_loss_elem: ['0.2258', '0.2351', '0.2916', '0.0908', '0.0972', '0.1771', '1.5755', '0.3259', '0.4193', '0.1623'], num_positive: 11.2000 2021-03-07 15:14:11,158 - INFO - task : ['trailer'], loss: 0.8950, hm_loss: 0.6023, loc_loss: 1.1709, loc_loss_elem: ['0.1410', '0.1478', '0.1197', '0.1316', '0.0734', '0.0514', '0.6290', '0.2809', '0.2147', '0.1092'], num_positive: 7.6000 2021-03-07 15:14:11,158 - INFO - task : ['bus'], loss: 1.3275, hm_loss: 1.1275, loc_loss: 0.8001, loc_loss_elem: ['0.1548', '0.1824', '0.0938', '0.0199', '0.0174', '0.0179', '0.0063', '0.0058', '0.2177', '0.0938'], num_positive: 12.8000 2021-03-07 15:14:11,158 - INFO - task : ['van'], loss: 0.7695, hm_loss: 0.5451, loc_loss: 0.8977, loc_loss_elem: ['0.1176', '0.1482', '0.1664', '0.0785', '0.0616', '0.0487', '0.2156', '0.3060', '0.0561', '0.1162'], num_positive: 5.8000 2021-03-07 15:14:11,158 - INFO - task : ['pedestrian'], loss: 0.3599, hm_loss: 0.2008, loc_loss: 0.6362, loc_loss_elem: ['0.2162', '0.1056', '0.1254', '0.0568', '0.0615', '0.0198', '0.0024', '0.0028', '0.0198', '0.0301'], num_positive: 8.0000
Do you have any idea why the loss for the car class does not get lower? I've seen it's mostly around 3 and 4, but for the other classes it started high and has converged rapidly till values around 1. In my dataset, the car class is the most represented, followed by the trailer class whose loss does decrease easily... I wonder why is this is not the case for car class
a) How would this statement apply to CenterPoint, can you explain please? I mean, what's the number of detection candidates in each frame? Is this number always fixed / the same for all frames? and how can I take the number of detection candidates in CenterPoint before NMS is applied?
Never read this and sorry that I don't have time to read this recently.
b) seems that one value is off a lot [''1.5755', '0.3259', '0.4193', '0.1623'],'] unfortuantelly I don't have much clues
I am aware that you are quite busy right now, but if you have some free time after ICCV, please try to answer a) of my previous comment. You don't need to read the whole CLOCs paper, only the README in the github link I posted, which would take you no more than 5 minutes. Thank you in advance tianweiy
For nuScenes, you will use box here https://github.com/tianweiy/CenterPoint/blob/1ecebf980f75cfe7f53cc52032b184192891c9b9/det3d/models/bbox_heads/center_head.py#L481
For Waymo (or if you want to train two stage model), you will use this https://github.com/tianweiy/CenterPoint/blob/1ecebf980f75cfe7f53cc52032b184192891c9b9/det3d/models/detectors/two_stage.py#L149
I am working on adding tutorials and code for easier training on custom datasets. I am also working on accelerating the inference a bit. I will try to merge these two asap but it may take me one week or so.
check my reply here https://github.com/tianweiy/CenterPoint/issues/78#issuecomment-812951128 though I believe you already finished sth.
Yes, I can show you in this gif how CenterPoint works after training on my own data recorded from the highway. The learned weights from nuscenes didn't help much, so I trained from scratch. Basically, you can see cars and trucks being detected and visualized in Rviz. Thanks a lot for all your support!
Two questions, this one maybe off-topic: Is there any reason why you decided to use det3d for your work? I mean, I've seen many researchers used instead openPCDet for their implementations
About the voxelization: for nuscenes you used 30000 as the maximum number of voxels and 20 as the maximum number of points per voxel. In other projects for kitti I saw 12000 as max_voxel_num and 100 as max_points_in_voxel.... Can you please explain how these values affect the performance of the network? For my own data I chose the same config as nuscenes, but I don't know if I should change it given that my point cloud frames are quite sparse....
topic: Is there any reason why you decided to use det3d for your work? I mean, I've seen many researchers used instead openPCDet for their implementations
I finished up most of the work for centerpoint before OpenPCDet gets Waymo / nuScenes support. I also tried OpenPCDet for a while last summer but it trains a little slower and the number is a little lower than my implementation here so I just stick with current codebase.
About the voxelization: for nuscenes you used 30000 as the maximum number of voxels and 20 as the maximum number of points per voxel. In other projects for kitti I saw 12000 as max_voxel_num and 100 as max_points_in_voxel.... Can you please explain how these values affect the performance of the network? For my own data I chose the same config as nuscenes, but I don't know if I should change it given that my point cloud frames are quite sparse....
I think 100 is too large probably. 30000/12000 are set to cover most of the voxels / pillars. 20/100 are not that important and smaller number will be faster.
Hello, I have one question about how the DataLoader works in the training script. I want to try Dynamic Voxelization, for which I need the points as input to the network instead of the voxels. The number of features I used for each point is 4. I checked the points are stored also in the example dictionary, but somehow the dimension of the points changes after the data is passed to the DataLoader, especially the second dim from 4 to 5, please look at this:
# print statement in datasets/pipelines/formatting.py, the last step in the pipeline for loading the data
# @PIPELINES.register_module
# class Reformat(object):
# def __init__(self, **kwargs):
# -----------------------------------
# def __call__(self, res, info):
# meta = res["metadata"]
# points = res["lidar"]["points"]
# voxels = res["lidar"]["voxels"]
# print("poins shape in Reformat: ", points.shape)
points shape in Reformat: (132250, 4)
# print statement in datasets/providentia.py, my own dataset wrapper, similar to waymo.py
# def get_sensor_data(self, idx):
# .....................................................
# data, _ = self.pipeline(res, info)
# print("points shape in get_sensor_data: ",data["points"].shape)
points shape in get_sensor_data: (132250, 4)
# print statement in torchie/trainer/trainer.py
# def train(self, data_loader, epoch, **kwargs):
# ......................................................
# for i, data_batch in enumerate(data_loader):
# print("points shape in data_batch: ", data_batch["points"].shape)
points shape in data_batch: torch.Size([132087, 5]) # here the dimensions change, from 4 to 5
# print statement in models/detectors/point_pillars.py
# def forward(self, example, return_loss=True, **kwargs):
# ..........................................................
# print("example points shape: ", example["points"].shape)
example points shape: torch.Size([132087, 5])
Do you have any idea why the dimensions change?
It is padded with an index to indicate its location in a batch, see this
you can remove this index
@tianweiy, I am facing a problem when trying to prepare WAYMO dataset for training. I followed your instructions but when invoking waymo_converter.py
it throws warnings and errors related to allocation on cuda device. (I use the same conda env + waymo-open-dataset-tf-1-15-0==1.2.0 on V100). Do you have any idea how to deal with this? and how much will be the size of the generated train folder ?
Thanks!
I have a question about how do you decide when to stop training, how do you choose the number of epochs for nuscenes and Waymo? You stop when the loss values decay below 0.5?
@YoushaaMurhij remember to disable gpu by CUDA_VISIBLE_DEVICES=-1 during conversion to work with the multiprocessing. I don't remember exactly but it is about 650G
@xavidzo never tune this. I always use 20 for nuScenes. For waymo, I often use 12 epochs to verify idea as it is too large ...
and regarding the point cloud range parameters, voxel size and grid_size:
what's the rule for the appropriate pc_range values?
I noticed that with voxel_size = (0.2, 0.2, 8), when I change the point cloud range from:
pc_range = (-70.4, -70.4, -8.0, 70.4, 70.4, 0.0)
to:
pc_range = (-70.4, -60.4, -8.0, 70.4, 60.4, 0.0)
I get an error of mismatching dimensions in the pillar encoder... what's the reason?
Does it have to be a square grid size with same size in width and length?
The grid size (range / voxel_size) needs to be divisible by 4 / 8, I don't remember exactly
Ok thanks, do you have maybe some advice on how to remove redundant boxes? I notice my detector predicts sometimes bounding boxes of the same class/label with high score ( > 0.6), but these boxes overlap a lot for the same object... there is only a tiny difference in the orientation predicted Please look at this example where two boxes are overlapping for the same truck
have you used any nms? It should be less than 0.2 overlapping? https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/configs/nusc/voxelnet/nusc_centerpoint_voxelnet_01voxel.py#L72
Yes, in tes_cfg I set the nms_iou_threshold to 0.2....so the smaller, the better to remove overlapping boxes? I thought a higher threshold removes more false positives as in 2D detection? Regarding the voxel size, smaller size makes the detections more precise, but the inference becomes slower, or is it only a matter of fine-tuning to a particular dataset? Also https://github.com/tianweiy/CenterPoint/blob/9fdf572e854d20d58f4b0b8ef9f5785a5d39147a/configs/nusc/pp/nusc_centerpoint_pp_02voxel_two_pfn_10sweep.py#L41 For the us_layer_strides parameter, how do you find a good setting, just trying what works well? does it depend on the point cloud density, number of layers in LiDAR?
Yes, in tes_cfg I set the nms_iou_threshold to 0.2....so the smaller, the better to remove overlapping boxes? I thought a higher threshold removes more false positives as in 2D detection?
These are two different threshold. The smaller iou threshold will remove more boxes. And then the higher score threshold will remove more false positives
Regarding the voxel size, smaller size makes the detections more precise, but the inference becomes slower, or is it only a matter of fine-tuning to a particular dataset?
Both
For the us_layer_strides parameter, how do you find a good setting, just trying what works well? does it depend on the point cloud density, number of layers in LiDAR?
I never tuned this.
I have more questions: a) you said you don't tune the us_layer_strides, but this is different for nuscenes and waymo, so I assume they are not equal for a reason(?).
b) Some time ago you mentioned "The Waymo model's output stride is 1 compared to 4 in nuScenes. (Waymo with output stride 4 doesn't perform well)", why? Is this reflected in the lines I cited in a) or where is this difference in the code?
c) for nuscenes you use the "db_sampler" and for waymo you don't, may I know why please?
Also it's written type="GT-AUG", but below is "enable=False", so you don't use GT-AUG in the end. Shouldn't it be instead "enable=True"?
d) I see also a difference in the parameter "global_rot_noise":
why are these values different and how does this "global_rot_noise" affect the training process?
e) I quite not understand how db_sampler works:
sample_groups=[
dict(car=2),
dict(truck=3),
dict(construction_vehicle=7),
dict(bus=4),
dict(trailer=6),
dict(barrier=2),
dict(motorcycle=6),
dict(bicycle=6),
dict(pedestrian=2),
dict(traffic_cone=2),
],
does it mean from every frame, you add more samples of ground-truth classes according to this distribution? so you add 2 cars more, 3 trucks more, and so on... to every frame? what's a good strategy? have higher numbers for classes that are less represented in the dataset or how do you set those numbers?
a) b) Some time ago you mentioned "The Waymo model's output stride is 1 compared to 4 in nuScenes. (Waymo with output stride 4 doesn't perform well)", why? Is this reflected in the lines I cited in a) or where is this difference in the code?
Yes, it is defined by the code above. Waymo is 1 following a few papers in the past. nuScenes is just 4(inherited from det3d). I tried 4 for Waymo, it doesn't work very well.
c) Waymo gets millions of cars and pedestrians so for the two class model, we don't use db_sampler. We do use db_sampler for three class model as the additional cyclist are rare.
d) still following original parameters. I remembered pi/4 for nuScenes also work and slightly better(<0.5)
e) Yes, copied from det3d
I am trying to train a two-stage model, but somehow I get some error of mismatching dimensions:
Traceback (most recent call last):
File "./tools/train.py", line 137, in
This is my model:
# model settings
model = dict(
type='TwoStageDetector',
first_stage_cfg=dict(
type="PointPillars",
pretrained='work_dirs/nusc_centerpoint_pp_02voxel_two_pfn_10sweep_one_head_no_vel_040vox/epoch_600.pth',
reader=dict(
type="PillarFeatureNet",
num_filters=[64, 64],
num_input_features=4,
with_distance=False,
voxel_size=(0.40, 0.40, 4),
pc_range=(-70.4, -70.4, -8.0, 70.4, 70.4, -4.0),
),
backbone=dict(type="PointPillarsScatter", ds_factor=1),
neck=dict(
type="RPN",
layer_nums=[3, 5, 5],
ds_layer_strides=[2, 2, 2],
ds_num_filters=[64, 128, 256],
us_layer_strides=[0.5, 1, 2],
us_num_filters=[128, 128, 128],
num_input_features=64,
logger=logging.getLogger("RPN"),
),
bbox_head=dict(
type="CenterHead",
in_channels=128*3,
tasks=tasks,
dataset='providentia',
weight=0.25,
code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
common_heads={'reg': (2, 2), 'height': (1, 2), 'dim':(3, 2), 'rot':(2, 2)}, # (output_channel, num_conv)
),
),
second_stage_modules=[
dict(
type="BEVFeatureExtractor",
pc_start=[-70.4, -70.4],
voxel_size=[0.40, 0.40],
out_stride=1
)
],
roi_head=dict(
type="RoIHead",
input_channels=128*3*5,
model_cfg=dict(
CLASS_AGNOSTIC=True,
SHARED_FC=[256, 256],
CLS_FC=[256, 256],
REG_FC=[256, 256],
DP_RATIO=0.3,
TARGET_CONFIG=dict(
ROI_PER_IMAGE=128,
FG_RATIO=0.5,
SAMPLE_ROI_BY_EACH_CLASS=True,
CLS_SCORE_TYPE='roi_iou',
CLS_FG_THRESH=0.75,
CLS_BG_THRESH=0.25,
CLS_BG_THRESH_LO=0.1,
HARD_BG_RATIO=0.8,
REG_FG_THRESH=0.55
),
LOSS_CONFIG=dict(
CLS_LOSS='BinaryCrossEntropy',
REG_LOSS='L1',
LOSS_WEIGHTS={
'rcnn_cls_weight': 1.0,
'rcnn_reg_weight': 1.0,
'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
}
)
),
code_size=7
),
NMS_POST_MAXSIZE=500,
num_point=5,
freeze=True
)
Could you please spot what might be wrong?
change
input_channels=128*3*5,
to 320
thank you very much, it worked, but why is it 128 3 5 in waymo model or 320 in my case?
uhm, I just saw from your log that it needs to be 320
RuntimeError: Given groups=1, weight of size [256, 1920, 1], expected input[1024, 320, 1] to have 1920 channels, but got 320 channels instead
I tried the second_stage in my model, but it did not improve, in fact, the performance was worse than single_stage...
Regarding the loss in this line https://github.com/tianweiy/CenterPoint/blob/cb25e870b271fe8259e91c5d17dcd429d74abc91/det3d/models/bbox_heads/center_head.py#L278
I see for nuscenes the weight you give to the loc_loss is 0.25, where as for waymo you give a weight of 2 Does it mean in waymo you care more about the localization of objects more than class regression and in nuscenes is the other way around? What's the effect of changing the value of this weight?
Maybe do you have a tip to improve the prediction of the heading angle (aka orientation). I can only get decent results after training for many epochs (> 200)...I guess there must be a more simple way without overfitting the training data...
Does it mean in waymo you care more about the localization of objects more than class regression and in nuscenes is the other way around?
nuScenes map is less strict than Waymo (measure 2d distance instead of 3d iou) so we use a smaller weight for regression.
What's the effect of changing the value of this weight?
I do line search for nuScenes weight, 2 -> 1 -> 0.5 -> 0.25 current one works the best and maybe a few percent higher than the original one but I don't get number for the current codebase.
Maybe do you have a tip to improve the prediction of the heading angle (aka orientation). I can only get decent results after training for many epochs (> 200)...I guess there must be a more simple way without overfitting the training data...
I don't know. I would be interested if you find anything
Hello @tianweiy, I wanted to comment I got much better orientation predictions results for my dataset after I changed the parameter global_rot_noise from [-0.3925, 0.3925], to [0.0, 0.0]
Now I have another problem, maybe you know what could be the reason: When I do inference with the original pcd files of my dataset, the network predicts very good bounding box detections However, when I transform my pcd files (I only translate the z coordinate of each point in the point cloud to change the range from [-8, 0] to [0, +8] ). Then, I adjust the pc_range and voxel_range accordingly in the config file of the network to match the changes in the z coordinate, but I get zero predictions, no results at all on these transformed pcd files. Do you have an idea of what I am missing?
Hi, I think it's really amazing the work you've done, congrats!! For my thesis project, I would like to use your network to analyze point cloud data, so right now what I have is only a series of .pcd files that are recordings of vehicles on the highway. A lidar sensor was mounted on a bridge, so the recordings are from a static point of view. Could you please give some hints how to prepare these data as an input to your network to perform the 3d object detection? Do I have to prepare database tables in .json format according to the nuscenes specifications? Is it a must? What other steps are required?
Thank you in advance