krrish94 / nerf-pytorch

A PyTorch re-implementation of Neural Radiance Fields
Other
884 stars 121 forks source link

Config is not fully applied #6

Closed kwea123 closed 3 years ago

kwea123 commented 4 years ago

For example here

https://github.com/krrish94/nerf-pytorch/blob/a14357da6cada433d28bf11a45c7bcaace76c06e/train_nerf.py#L117-L123

The hidden_size in the config is never applied... which means it will always be set to the default value 128 no matter what you set in the config file... I suspect that there are still many places where the config is not fully applied, which means your implementation might differ a lot from the original repo.

I was first astonished at the speed you mentioned in the readme, but now it seems that your model is not the same as the original repo... yours is tinier (hidden_size=128 in contrast to original 256), so no wonder your speed is faster!

Could you please carefully check again, or mention in the readme that you use a smaller model? Otherwise it is misleading.

The code is thoroughly tested (to the best of my abilities) to match the original implementation (and be much faster)! In particular, I have ensured that

Every individual module exactly (numerically) matches that of the TensorFlow implementation. This Colab notebook has all the tests, matching op for op (but is very scratchy to look at)! Training works as expected (for Lego and LLFF scenes).

krrish94 commented 4 years ago

Hi @kwea123,

Great catch! What you say is true for the FlexibleNeRFModel (whose default hidden_size is 128). You could replace it with PaperNeRFModel, which initializes the NeRF model with a default hidden_size of 256.

On my tests with the PaperNeRFModel (see this), I get an average speed of 14.2 it/s (on an RTX2080Ti, PyTorch 1.4.0, CUDA 9.2).

In comparison, under the same setup, this (awesome!) PyTorch port runs at 5.7 it/s (also with 3 layers lesser than my exp), and the TensorFlow release (in a different environment) runs at 3.44 it/s), again with three fewer layers.

As for not applying hidden_size in the config, I'm unsure when this changed. As you can probably see, I applied quite a few changes over the last 72 hours, which might've broken a few things along the way :)

But yes, many thanks for alerting! I know benchmarking can be quite arbitrary, and relies on several underlying factors (SW, HW, and more). I'll leave this issue open, for that reason.

(Also, the test cases that compare the TF impl to mine function-by-function are all here :) )

kwea123 commented 4 years ago

Ok I modified the config to PaperNeRFModel, but currently there is a bug. https://github.com/krrish94/nerf-pytorch/blob/a14357da6cada433d28bf11a45c7bcaace76c06e/nerf/models.py#L164-L165 You forgot to add x = xyz between these lines, so the x in the following for loop is not the correct input. After adding that line I can successfully run the code, but in my setup (also RTX2080Ti, pytorch 1.4.0 CUDA 10.2) I can only get 8.5it/s (6.8it/s for the other pytorch impl and 5.64it/s for the original repo). Anyways thanks for the contribution, and hope you can correct these bugs here and there in the future.

krrish94 commented 4 years ago

@kwea123 I thought I fixed this, my bad :)

It should in fact be

x = self.relu(self.layers_xyz[0](xyz))
for i in range(1, 8):
     if i == 4:
        x = self.layers_xyz[i](torch.cat((xyz, x), -1))
    else:
        x = self.layers_xyz[i](x)

I'll merge this in in a couple days, but if it's urgent, feel free to PR

krrish94 commented 4 years ago

Also, I'd be curious to help debug further. Did you run cache_dataset.py to precompute and cache rays?

kwea123 commented 4 years ago

I didn't run that. I just wanted to see the speed.

krrish94 commented 4 years ago

Caching, in my experimens, contributed to a bump up of +4 it/sec

kwea123 commented 4 years ago

In my understanding the caching is only useful for ray_bundle, is that correct? Other values are just scalars and don't need much computation so I think caching them only increases a little speed.

Moreover, do we even need to store them on the disk? Why not just store in the RAM before training starts? For example for lego half resolution data, it has 100x400x400x8x4 bytes = 0.47GB only, which is feasible.

krrish94 commented 4 years ago

I had the same presumption, so refrained from implementing caching when I initially put this repo out :)

Can load it all into RAM, but the random sampling at each iter seems to take longer, compared to "sampling beforehand and caching to disk". In my case, I sample about 500 random variants of 1024 samples each per view, and this becomes hard to store in RAM, more so for LLFF data.

kwea123 commented 4 years ago

Do you have an action profile? Data loading, model forward takes how much time, backward takes how much time, etc.?

krrish94 commented 4 years ago

I did, when I profiled. But I must admit, I didnt' religiously log most stuff. However, I jotted down a few timing details in my journal, on how much time a particular module took before vs after optimization, for a few modules.

Before I began profiling, I did a few obvious optimization steps (like ensure all Tensors are on CUDA, vectorize ops, etc.). After that, I had a time of 0.03 seconds for 1 iteration.

(The notion of an iteration is same as the original nerf code release (i.e., until the learning rate is updated for the next iter). Also, note that the time reported by the tqdm loader includes logging time to tensorboard etc.)

It was very hard to decipher running times from most of my notes (sorry!), but I've been able to recover a few.

Time to run one iteration (before opt): 0.06 seconds

Fetch one batch of data (randomly sample, no caching): 0.010634 sec
Fetch one batch of data (randomly sample, load from cache): 0.00413 sec

Positional encoding (with a for loop): 0.00157 sec
Positional encoding (with list comprehension): 0.00150 sec (no improvement)

Run network (with a for loop): 0.002 sec
Run network (with list comprehension): 0.002 sec (no improvement)

Predict volume density and render (initial): 0.05 sec
Predict volume density and render (optimized impl): 0.035 sec

Ensure tensors don't move across CPU and GPU:
sample_pdf (before optimizing): 0.09 sec
sample_pdf (after optimizing): 0.0002 sec!

run_one_iter (after optimizing): 0.036 sec

One big caveat: I haven't been able to recall the exact config params that I used.

Posting here, so it helps people by hinting where further optimization is possible :)

yenchenlin commented 4 years ago

@krrish94 num_fine: 64 is also wrong, it should be num_fine: 128. I get very similar benchmark results as @kwea123 on my computer. There might be other differences in hyperparameter that cause the runtime difference but I don't have time to check.

To compare runtime fairly, what I did is to make sure the forward/backward results in the training loop match exactly as the TF implementation for certain iterations. I think that's probably the only way to prevent hyperparameter mismatch.

Thanks for the work though!

kwea123 commented 3 years ago

Bug not fixed, speed is questionable. Closing.