nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
80 stars 11 forks source link

why 'end_points['fp2_inds'] = end_points['sa1_inds'][:, 0:num_seed]' ? #33

Closed linhaojia13 closed 1 year ago

linhaojia13 commented 1 year ago

In the models/backbone_module.py, you select the first 1024 out of 2048 sa1_inds as fp2_inds. I can understand that the intention behind this is to obtain the indices of these 1024 seed points in the entire point cloud, in order to participate in the loss calculation in the function compute_points_obj_cls_loss_hard_topk (which is in models/losses.py).

However, directly selecting the first 1024 out of 2048 sa1_inds does not correspond one-to-one with fp2_xyz. This mismatch would cause euclidean_dist1 and object_assignment_one_hot variables in the function compute_points_obj_cls_loss_hard_topk to not be aligned one-to-one. Doesn't this introduce an error in the supervision signal for KPS?

models/backbone_module.py:

        # --------- 2 FEATURE UPSAMPLING LAYERS --------
        features = self.fp1(end_points['sa3_xyz'], end_points['sa4_xyz'], end_points['sa3_features'],
                            end_points['sa4_features'])
        features = self.fp2(end_points['sa2_xyz'], end_points['sa3_xyz'], end_points['sa2_features'], features)
        end_points['fp2_features'] = features
        end_points['fp2_xyz'] = end_points['sa2_xyz']
        num_seed = end_points['fp2_xyz'].shape[1]
        end_points['fp2_inds'] = end_points['sa1_inds'][:, 0:num_seed]  # indices among the entire input point clouds

        return end_points

models/losses.py:

def compute_points_obj_cls_loss_hard_topk(end_points, topk):
    box_label_mask = end_points['box_label_mask']
    seed_inds = end_points['seed_inds'].long()  # B, K
    seed_xyz = end_points['seed_xyz']  # B, K, 3
    seeds_obj_cls_logits = end_points['seeds_obj_cls_logits']  # B, 1, K
    gt_center = end_points['center_label'][:, :, 0:3]  # B, K2, 3
    gt_size = end_points['size_gts'][:, :, 0:3]  # B, K2, 3
    B = gt_center.shape[0]
    K = seed_xyz.shape[1]
    K2 = gt_center.shape[1]

    point_instance_label = end_points['point_instance_label']  # B, num_points
    object_assignment = torch.gather(point_instance_label, 1, seed_inds)  # B, num_seed
    object_assignment[object_assignment < 0] = K2 - 1  # set background points to the last gt bbox
    object_assignment_one_hot = torch.zeros((B, K, K2)).to(seed_xyz.device)
    object_assignment_one_hot.scatter_(2, object_assignment.unsqueeze(-1), 1)  # (B, K, K2)
    delta_xyz = seed_xyz.unsqueeze(2) - gt_center.unsqueeze(1)  # (B, K, K2, 3)
    delta_xyz = delta_xyz / (gt_size.unsqueeze(1) + 1e-6)  # (B, K, K2, 3)
    new_dist = torch.sum(delta_xyz ** 2, dim=-1)
    euclidean_dist1 = torch.sqrt(new_dist + 1e-6)  # BxKxK2
    euclidean_dist1 = euclidean_dist1 * object_assignment_one_hot + 100 * (1 - object_assignment_one_hot)  # BxKxK2
nickgkan commented 1 year ago

These indices are the result of farthest point sampling (FPS). In short, given N anchor points and K free points, this algorithm searches the K free points to find the point with maximum distance from any anchor point. Think of it as max_k(min_n(d(k, n))).

In a single run, the internal PointNet++ function that computes FPS starts from the same point, so its solution is deterministic. It returns the indices starting from the first point and then appends the next point as described above. As a result, if its solution for N1 is X, its solution for N2 < N1 is X[:N2]. We exploit this to easily compute the indices without having to call the FPS function again (which will return an identity range since the points form a solution already).

linhaojia13 commented 1 year ago

Oh, I see what you mean! Thank you very much!