I was wondering whether you would be willing to share your recommended workflow for using your model on a custom dataset with a different target class and data source than the ones used in the paper (buildings from Google Street View in my case)
I am thinking along these lines:
Segment GSVs with PointRend to obtain segmentation masks
--> Which output format do you expect for the masks?
Obtain approximate pose distribution
--> Not sure how to obtain this, tbh, any tips? What concretely needs to be provided to the model?
Train model
--> Would you recommend starting the training from scratch or can your pre-trained checkpoint work with out-of-distribution classes, such as buildings?
Run inference
--> How long does inference on a single image take approximately?
Sorry for the late reply! For some reason I thought I had already replied. Perhaps I drafted an answer but forgot to send it! :)
The format is not really that important, since you can write a custom data loader. In the end, the model expects binary segmentations masks in the [0, 1] range, but in our datasets the masks are stored in RLE compressed format using pycocotools (in order to save space).
This is the tricky part. Ideally, you would need a rough pose annotation for each image in your dataset, according to some canonical reference frame (e.g. the roof should point towards +Y, the front door towards +X, or something like that). You might need to find or build a pose estimator for buildings. Alternatively, since you are using Google Street View images, you could exploit multi-view information, i.e. correspondences between different images. Then you can run COLMAP or similar methods, and align the poses to a canonical frame.
I recommend starting from scratch, since the distribution is very different.
This highly depends on the GPUs and hyperparameters, but I think that with proper tuning you should be able to get 1 image/second.
Hi Dario,
Just read your paper - super cool work!
I was wondering whether you would be willing to share your recommended workflow for using your model on a custom dataset with a different target class and data source than the ones used in the paper (buildings from Google Street View in my case)
I am thinking along these lines:
Segment GSVs with PointRend to obtain segmentation masks --> Which output format do you expect for the masks?
Obtain approximate pose distribution --> Not sure how to obtain this, tbh, any tips? What concretely needs to be provided to the model?
Train model --> Would you recommend starting the training from scratch or can your pre-trained checkpoint work with out-of-distribution classes, such as buildings?
Run inference --> How long does inference on a single image take approximately?
Thank you for your support and see you at CVPR,
Kevin