ttaoREtw / ImGeoNet

[ICCV 2023] ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection
https://ttaoretw.github.io/imgeonet/
7 stars 0 forks source link

Questions about supervision of geometry shaping module #2

Closed Cindy0725 closed 1 month ago

Cindy0725 commented 9 months ago

Hi it's great work!

I am very interested in the supervision of the geometry shaping module. If am not wrong, the input of the geometry module is V (HxWxDxC). If will go through several Conv3D and Conv3D(T) layers and output the geometry shaping weight, which has the same size with V. I am wondering the detailed steps of getting the ground truth surface voxels.

1704880588543

Assume we have 20 images per scene, then the "RGB-D frames" here means converting the 20 depth images to 20 sparse point clouds, and if one voxel doesn't contain any point of the 20 point clouds, the ground truth value will be negative or 0, and if one voxel contain points from the 20 point clouds, the ground truth value will be positive or 1?

Another question is about "for each camera ray, we also consider locations neighboring surface voxels within margin as positive". what does "for each camera ray" means here? In this geometry shaping step, the multi-view features are already fused. I am confused about the steps of selecting the neighbors.

The third question is the size of the ground truth surface voxel. Is it a HxWxDxC tensor with values 0 and 1? Then you apply focal loss to it and the predicted weight to supervise the geometry shaping model?

Really looking forward to your kind reply. Thank you very much!

ttaoREtw commented 9 months ago

Hey Cindy, thanks for your interest. Here's the script I used for creating the surface voxel labels. It's a bit messy right now since I haven't had time to clean it up. I believe it will answer all your questions. I'll release the code as soon as possible, probably in Feb.

def compute_target(points, img_meta, depth_maps, depth_masks, voxel_size, depth_cast_margin):
    device = points.device
    n_images = len(img_meta['lidar2img']['extrinsic'])
    H, W = depth_maps.shape[1], depth_maps.shape[2]
    n_x_voxels, n_y_voxels, n_z_voxels = points.shape[-3:]
    points = points.view(1, 3, -1).expand(n_images, 3, -1)
    # (num_images, 3+1, num_voxels)
    points = torch.cat((points, torch.ones_like(points[:, :1])), dim=1)
    # (num_images, 3, num_voxels)
    points_2d = torch.bmm(compute_projection(img_meta).to(device), points)
    # (num_images, num_voxels)
    z = points_2d[:, 2]
    x = (points_2d[:, 0] / z).round().long()
    y = (points_2d[:, 1] / z).round().long()

    valid = (x >= 0) & (y >= 0) & (x < W) & (y < H) & (z > 0)

    n_voxels = points.shape[-1]
    gt_depth = torch.zeros((n_images, n_voxels), device=device)
    for i in range(n_images):
        valid[i, valid[i]] = valid[i, valid[i]] & depth_masks[i, y[i, valid[i]], x[i, valid[i]]]
        gt_depth[i, valid[i]] = depth_maps[i, y[i, valid[i]], x[i, valid[i]]]

    extrinsic = torch.tensor(np.stack(img_meta['lidar2img']['extrinsic'])).to(device)
    # Shape: (num_images, 3+1, num_voxels)
    points_cam = torch.bmm(extrinsic, points)
    # Shape: (num_images, num_voxels)
    vx_depth = points_cam[:, 2] / points_cam[:, 3]
    margin = voxel_size[2] * (depth_cast_margin * 0.5)
    for i in range(n_images):
        gt_dep = gt_depth[i, valid[i]]
        vx_dep = vx_depth[i, valid[i]]
        valid[i, valid[i]] = valid[i, valid[i]] & ((gt_dep <= vx_dep + margin) & \
                                                   (vx_dep - margin <= gt_dep))
    valid = valid.view(n_images, 1, n_x_voxels, n_y_voxels, n_z_voxels)
    target_occ = (valid.sum(dim=0) > 0)
    return target_occ
Cindy0725 commented 9 months ago

Hi @ttaoREtw, thank you very much for your kind reply.

I have some questions about the argument of this function:

  1. points are the ground truth point cloud of the whole scene (not point cloud generated from each RGBD frame)?
  2. depth_maps and depth_masks are from the ground truth depth information for each frame in one scene? e.g. for scannet, I can extract these two variables by the script in scannet repo?
  3. what is the value of depth_cast_margin?
  4. And what is the relationship between (n_x_voxels, n_y_voxels, n_z_voxels) and voxel_size in this function? Are they the same?

Looking forward to your kind reply. Have a nice day!

ttaoREtw commented 8 months ago
  1. var points can be generated by this script:
    @torch.no_grad()
    def get_points(n_voxels, voxel_size, origin):
    points = torch.stack(torch.meshgrid([
        torch.arange(n_voxels[0]),
        torch.arange(n_voxels[1]),
        torch.arange(n_voxels[2])
    ]))
    new_origin = origin - n_voxels / 2. * voxel_size
    points = points * voxel_size.view(3, 1, 1, 1) + new_origin.view(3, 1, 1, 1)
    return points
  2. Yes, they can be extracted by the original ScanNet script. In each batch, we will sample 20 views for training.
  3. depth_cast_margin=4
  4. voxel_size=(.16, .16, .16), n_voxels=(40, 40, 16)

I hope this answers your questions.

ttaoREtw commented 1 month ago

This file will answer your question.