Ques about PointHead？

huixiancheng commented 2 years ago

Hi! Thanks for open source！I'm interested in this part of 3d seg head. However it looks like this part of the code is not friendly to larger batch sizes： https://github.com/haomo-ai/MotionSeg3D/blob/dc4c95fcdba2f0819d2bbc4a419f231c55e9c6f3/modules/trainer_refine.py#L165-L195

In my test case, in the refinement stage, the time cost of bs == 1 is the same as the time cost of bs == 6 (one epoch).

It's the reason that you set bs==1 in second phase of training? https://github.com/haomo-ai/MotionSeg3D/blob/dc4c95fcdba2f0819d2bbc4a419f231c55e9c6f3/train_yaml/mos_pointrefine_stage.yml#L12-L20

It seems that the training speed in the refinement phase may be limited by the smaller batch size, is there a better solution?

MaxChanger commented 2 years ago

Thanks for your attention to our project. The current training method is manually divided into two stages (still to be optimized), as you said in "refine stage" (using PointHead), batch_size is set to 1.

This is because if the PointHead is trained with larger batches, the number of point clouds needs to be downsampled or upsampleed to the same number (just like the preprocessing in some common point cloud segmentation networks). It can use multiple batches of input in the image-backbone, but when it comes to PointHead, it still have to manually split it into a single batch, and sample to the same number of points. However, it is also a time-consuming process to sample points and form batches in the forward stage, if such processing is not performed in the dataloader with multi-threads.

It is better that since the image-backbone has been trained well in the coarse stage, the refine of PointHead can converge well in 2~5 epochs.

And if there is an operation that can quickly implement sample points in the forward process and form a larger batch, welcome let me know how to optimize this part.

huixiancheng commented 2 years ago

Thx for your kind reply. Have you ever seen the KPRNET and its code, I personally think the idea is very similar. I'm not sure if you considered this approach： In fact, if we don't split data here, all the data will keep the length of max_points which set in config. https://github.com/haomo-ai/MotionSeg3D/blob/dc4c95fcdba2f0819d2bbc4a419f231c55e9c6f3/modules/trainer_refine.py#L167-L175 Also, 150000 is to large,130000 is totally enough. If we change the padding in unproj_labels and unproj_xyz from -1 to 0, the padding part will be ignored in calculating the loss in theory.

But look like this will not work, since the sparse_quantize ops will change the size of length again...................😂😂

Maybe the only way to make up the same size batch data is to do it like the SPVNAS code. https://github.com/mit-han-lab/spvnas/blob/69750e900d8687ac9fcc8e042b171cd1f6beffa1/core/datasets/semantic_kitti.py#L216-L226 Fucking hate this.

MaxChanger commented 2 years ago

Thanks for your reminding. I have checked KPRNet, and this scheme is similar to our PointHead, but with different structure. By the way, the PointHead here can actually use any structure, MLP, PointNet++, RandLANet... . Depends on the balance between accuracy and speed. We didn't choose to use KPConv/RandLANet in the end because it's a very time-consuming process to compute KNN in the forward without putting the KNN into __getitem__.

In kprnet/deeplab.py#L65-L75, does the input seem to be divided according to Batch before inputted to KPConv, and then concatenate them? [Better if you help to confirm further]

    def forward(self, x, px, py, pxyz, pknn):
        x = resample_grid(x, py, px)
        res = []
        for i in range(x.shape[0]):
            points = pxyz[i, ...]
            feats = x[i, ...].transpose(0, 2).squeeze()
            feats = self.kpconv(points, points, pknn[i, ...], feats)
            res.append(feats.unsqueeze(2).transpose(0, 2).unsqueeze(2))
        res = torch.cat(res, axis=0)
        res = self.relu(self.bn(res))
        return res

As you provided, one possible solution is to put #L216-L226 of __getitem__ in spvnas into the forward function. But I vaguely remember that this doesn't speed training up significantly because the for loop is still serial, similar with batch_size=1, the difference may be the size of the mini_batch for computing loss and backwarding.

As for max_points, I have counted the maximum and minimum values of LiDAR points on the whole sequence before (it might be more reasonable to count a histogram).

== train ==
Seq 00 | min: 101387 / max: 128146
Seq 01 | min: 82602 / max: 125269
Seq 02 | min: 106946 / max: 129261
Seq 03 | min: 116566 / max: 127239
Seq 04 | min: 121965 / max: 127321                                                                                                
Seq 05 | min: 107038 / max: 128267
Seq 06 | min: 111982 / max: 125148       
Seq 07 | min: 109417 / max: 128169   
Seq 09 | min: 102813 / max: 127526     
Seq 10 | min: 109813 / max: 129392

Seq 30 | min: 42122 / max: 124080
Seq 31 | min: 87224 / max: 125140
Seq 32 | min: 93905 / max: 125054
Seq 33 | min: 104566 / max: 125572
Seq 34 | min: 98717 / max: 123876
Seq 40 | min: 82602 / max: 125269

== validation ==
Seq 08 | min: 92476 / max: 128443
Seq 35 | min: 69719 / max: 114278
Seq 36 | min: 104443 / max: 127104
Seq 37 | min: 99051 / max: 122808
Seq 38 | min: 113621 / max: 124909
Seq 39 | min: 121965 / max: 127321
Seq 41 | min: 89106 / max: 127147

== test ==
Seq 11 | min: 113918 / max: 128706
Seq 12 | min: 100507 / max: 127289
Seq 13 | min: 96271 / max: 125787
Seq 14 | min: 120066 / max: 128726
Seq 15 | min: 92478 / max: 128491
Seq 16 | min: 107913 / max: 127543
Seq 17 | min: 105921 / max: 124015
Seq 18 | min: 101549 / max: 127099
Seq 19 | min: 99373 / max: 127097
Seq 20 | min: 89106 / max: 127147
Seq 21 | min: 107672 / max: 127594

MaxChanger commented 2 years ago

This is also an option that I think is not perfect and worth optimizing. Serial processing, although it can bring performance gains, is not so elegant, especially for the speed of inference.

BTW, the other day I saw an article that fused Range view, BEV and point cloud in the feature dimension (unfortunately I suddenly couldn't think of the title of the paper), like Cylinder3D and SPVCNN that fused Point and Voxel forms. It may be more elegant to fuse multiple point cloud forms in this way in parallel.

huixiancheng commented 2 years ago

In my opinion, KPRNET use padding to make sure all batch data keep same length in dataset prepare stage (and here).

Yep，you are totally right. Looks like the for loop is unavoidable. But I'm not so sure if it would be faster if we slice the data into into uniformly length and then fed them into the refine_module. Like the persudo code under follow:

tmp_inputs = []
tmp_labels = []
for j in range(len(n_points)):
    _npoints = n_points[j]
    _px = p_x[j, :_npoints]
    _py = p_y[j, :_npoints]
    _unproj_labels = unproj_labels[j, :_npoints]
    _points_xyz = unproj_xyz[j, :_npoints]

    # put in original pointcloud using indexes
    _points_feature = last_feature[j, :, _py, _px]

    # Filter out duplicate points
    coords = np.round(_points_xyz[:, :3].cpu().numpy() / 0.05)
    coords -= coords.min(0, keepdims=1)
    coords, indices, inverse = sparse_quantize(coords, return_index=True, return_inverse=True)
    coords = torch.tensor(coords, dtype=torch.int, device='cuda')

    feats = _points_feature.permute(1, 0)[indices]

    if len(indices) > 80000:
        inds = np.random.choice(indices, 80000, replace=False)

    inputs = SparseTensor(coords=coords[indices], feats=feats[indices])
    inputs = sparse_collate([inputs]).cuda()
    tmp_inputs.append(inputs)

tmp_inputs = torch.cat(tmp_inputs)
predict = self.refine_module(inputs)

Not sure the speed difference between The net forward forward N times and The net forward 1 time with N batch size.

Maybe you mean CPGNET? Multi-branching or multi-representation structures may be a better way to reduce information loss and improve performance，but will inevitably lead to more computational resource consumption and some optimization problems. 😂 😂 😂

MaxChanger commented 2 years ago

I've noticed that, too, but I think the better upample scheme I've seen before is to randomly repeat the samples in the original points, which is more reasonable than adding zero points?
Yeah, I have get your point. This may require debugging to confirm. I will try it when I have time or you could try it. Another possible solution would be to take this annoying for loop and run it through multiple threads, but I'm not sure there would be any other problems.
Thanks for your sharing. A related work is ''RPVNet: A Deep and Efficient Range-Point-Voxel Fusion Network for LiDAR Point Cloud Segmentation'', which combine the range-point-voxel features (w/o BEV).

huixiancheng commented 2 years ago

Thank you for all your kind replies.

haomo-ai / MotionSeg3D

Ques about PointHead？ #4