Closed dogyoonlee closed 1 year ago
I have the same question, waiting for reply.
But I tried to put it on TPUs, it worked on 16384 stably, but I found it too slow that MXU was only up to 20%, what should I modify to let the TPUs work more efficiently.
Yes, if you reduce the batch size you should increase the number of iterations and decrease the learning rate. This is usually referred to as the "linear scaling rule", there's a lot of information available about this online.
@jonbarron Great! Thanks a lot :).
In addition, I have one more question.
I train raw-nerf using train_raw.sh
on morningkitchen
scene in raw-nerf dataset.
However, psnr have increased around 52 during training, but it fail in evaluation with psnr around 13.
All I modified parameter are batch_size
and render_chunk_size
in llff_raw.gin
file and name of the scene as morningkitchen
to train in train_raw.sh
.
Is there any problems to train raw-nerf?
You changed the batch size but not the learning rate and the number of iterations? If so, you should change the learning rate and number of iterations as per the linear scaling rule.
I miss to notice the detailed parameter I used in training.
I tried training hyper parameter as follows since I use 2 gpu(RTX 3090).
As I understand the linear scaling rule, it is important that how many batch is computed per gpu.
Hence I modified learning rate as 1/8 from original parameter since raw-nerf use 16 TPU in original setting as I know.
In addition, I modified learning rate delay steps and max iteration as 8 times of original setting to train full optimization.
But it doesn't work and still show poor evaluation performance(around 15 PSNR) despite of high training performance(around 52 PSNR).
I stopped training in 740000 steps since it still shows poor performances as follows
Modified hyperparameters in
llff_raw.gin
:
Config.batch_size = 2048
Config.render_chunk_size = 2048
Config.lr_init = 0.000125
Config.lr_final = 0.00000125
Config.max_steps = 4000000
Config.checkpoint_every = 25000
Config.lr_delay_steps = 20000
Config.lr_delay_mult = 0.01
Config.grad_max_norm = 0.1
Config.grad_max_val = 0.1
Config.adam_eps = 1e-8
Is there any wrong here?
Now I'm training with Config.lr_init = 0.0000625
and Config.lr_final = 0.000000625
.
Thank you for your help!
I don't think the number of GPUs is relevant here.
@jonbarron
I'm sorry to bother you again.
I tried many values for Config.batch_size
, Config.render_chunk_size
, Config.lr_init
, Config.lr_final
, Config.max_steps
, and Config.lr_delay_steps
on 2 GPU(RTX 3090).
But none of them worked.
Especially, if training iteration reach the specific step(it varies along the hyperparameter setting) training PSNR drastically fall.
When I set the lr_init
and lr_final
as 1.5625e-5
and 1.5625e-7
, which is really small compared to original setting, training PSNR can increase until around 17, but it fall again after the 6400 iteration(with lr_decay_steps
=160000).
I suppose there is also the problem of warmup iteration setting(lr_decay_steps
), independent to learning rate.
I will try another hyperparameters as you said according to linear scaling weight.
Again, thank you for you help and awesome work!!
I'm using two RTX 3090 to run the multinerf(especially raw-nerf).
Due to OOM issue, I modify batch_size and render_chunk_size as 2048, respectively. Safely, it works but I wonder the extra modification to reproduce the results of the paper.
For example, if I use batch_size and render_chunk_size as 1/N compared to 16384, which is original setting of this repo, training iteration should be N*500000?
In addition, is it necessary to modify learning rate, decay steps, and other parameters?
Thank you for your awesome work!