google-research / nerf-from-image

Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion
Apache License 2.0
377 stars 18 forks source link

Recommended Workflow for Custom Dataset #11

Closed kdmayer closed 1 year ago

kdmayer commented 1 year ago

Hi Dario,

Just read your paper - super cool work!

I was wondering whether you would be willing to share your recommended workflow for using your model on a custom dataset with a different target class and data source than the ones used in the paper (buildings from Google Street View in my case)

I am thinking along these lines:

  1. Segment GSVs with PointRend to obtain segmentation masks --> Which output format do you expect for the masks?

  2. Obtain approximate pose distribution --> Not sure how to obtain this, tbh, any tips? What concretely needs to be provided to the model?

  3. Train model --> Would you recommend starting the training from scratch or can your pre-trained checkpoint work with out-of-distribution classes, such as buildings?

  4. Run inference --> How long does inference on a single image take approximately?

Thank you for your support and see you at CVPR,

Kevin

dariopavllo commented 1 year ago

Hi Kevin,

Sorry for the late reply! For some reason I thought I had already replied. Perhaps I drafted an answer but forgot to send it! :)

  1. The format is not really that important, since you can write a custom data loader. In the end, the model expects binary segmentations masks in the [0, 1] range, but in our datasets the masks are stored in RLE compressed format using pycocotools (in order to save space).
  2. This is the tricky part. Ideally, you would need a rough pose annotation for each image in your dataset, according to some canonical reference frame (e.g. the roof should point towards +Y, the front door towards +X, or something like that). You might need to find or build a pose estimator for buildings. Alternatively, since you are using Google Street View images, you could exploit multi-view information, i.e. correspondences between different images. Then you can run COLMAP or similar methods, and align the poses to a canonical frame.
  3. I recommend starting from scratch, since the distribution is very different.
  4. This highly depends on the GPUs and hyperparameters, but I think that with proper tuning you should be able to get 1 image/second.
kdmayer commented 1 year ago

Thanks for sharing your advice!