jasonyzhang / ners

Code for "NeRS: Neural Reflectance Surfaces for Sparse-View 3D Reconstruction in the Wild," in NeurIPS 2021
https://jasonyzhang.com/ners
BSD 3-Clause "New" or "Revised" License
299 stars 32 forks source link

Evaluation set used in paper? #3

Closed wangjksjtu closed 2 years ago

wangjksjtu commented 2 years ago

@jasonyzhang Thanks for the awesome work and public code!! In this paper, 20 actors in MVMC dataset are used for quantative evaluation, could you please share the actor ids on evalution set for easier comparisons?

Thanks for your great help!

wangjksjtu commented 2 years ago

@jasonyzhang Sorry for the follow up message! Could you also share more about the quantative evaluation details (e.g., which view is held out) that would be super helpful for reproducing and comaprions, thanks!

jasonyzhang commented 2 years ago

Hi, I'm working on releasing it asap! Will post the code, models, and splits to recreate the numbers, hopefully by end of the week.

wangjksjtu commented 2 years ago

Thanks so much for your help! really appreciate it;)

jasonyzhang commented 2 years ago

Hi,

Sorry for the delay! I've now posted all the data for evaluation, which includes the off-the-shelf camera (pre-processed to minimize re-projection error between the template car mesh and the mask) and the optimized cameras (which have also been processed with some manual input).

The data also includes the rendered views from NeRS in the NVS evaluation protocol. I show how to replicate the numbers using the rendered views as well.

Please let me know if you encounter any issues!

wangjksjtu commented 2 years ago

Hi @jasonyzhang,

Thanks for the update!! really appreciate it! I could reproduce the numbers using provided evaluation protocol. I noticed that if we use the clean-fid to compute the FID scores, the numbers are inconsistent with papers.

Name            MSE   PSNR   SSIM  LPIPS   clean-FID
ners_fixed   0.0254   16.5  0.720  0.172   113.

I guess you are using pytorch-fid to compute FID scores in the paper. Would you mind sharing the clean FID scores for all the baseline models? Thanks a lot!

wangjksjtu commented 2 years ago

Sorry, another question is that from the eval code, seems that the evaluation is done on all views (both training views and a held-out view) is it the correct setting? I thought that we should only eval on the novel views?

wangjksjtu commented 2 years ago

Hi @jasonyzhang, Annother question is that retraining results obtained by running train_evaluation_model.py is more blurry compared to dumped results in data/evaluation. Here is one example:

re-trained model: render_00

dumped results: ners_00_fixed

Is it due to different hyperparameters? Thanks a lot for your great help in advance!

jasonyzhang commented 2 years ago

Hi,

Re: FID I computed FID over all of the generated outputs (ie every image generated for every instance) rather than averaging the FID per instance as done for the other metrics. I've posted the code for this now, and here are the number I get:

Name            MSE   PSNR   SSIM  LPIPS    FID
ners_fixed   0.0254   16.5  0.720  0.172   60.4

The FID for ners_fixed in the paper was 60.9, so only slightly off.

Re: Evaluation protocol. In the evaluation training code, each image/camera pair is independently treated as a target image/camera. For example, if an instance has 10 images, we would train 10 models, where each model holds out one of the target views. For each of the models, we render from the held out view for evaluation. Thus, we end up evaluating all of the input images even though they are all held out views.

Re: blurry results. I was trying to train a smaller model to save time, but it looks like the performance is much worse. Evaluation code was training an 8-layer texnet for 1000 iterations, whereas demo code trains a 12-layer texnet for 3000 iterations. I switched back to the latter set of hyperparameters. I'm currently re-running it as well.

wangjksjtu commented 2 years ago

I see I see, thanks a lot for the detailed reply! really appreciate it ;)

jasonyzhang commented 2 years ago

Ahh actually the blurry results is because the number of fourier bases in the default config is too low. The default is 6, but 10 seems to work much better.

Rendering used for evaluation in main paper render_submission

8-layer tex net, 1k training iterations, L=6 render_8_layer

12-layer tex net, 3k training iterations, L=6 render_12_layer

12-layer tex net, 3k training iterations, L=10 render_12_layer_L10

I have updated the code so that evaluation defaults to L=10. This was already the default for the demo script.