tobias-kirschstein / nersemble

[Siggraph '23] NeRSemble: Neural Radiance Field Reconstruction of Human Heads
https://tobias-kirschstein.github.io/nersemble/
198 stars 9 forks source link

Training Requirement #5

Closed LeeHanmin closed 11 months ago

LeeHanmin commented 1 year ago

Hello, this is really a fascinating job. I have a question. I understand that this work requires at least A6000 48GB for training, but I currently only have a few 24GB 3090 graphics cards. Can I use distributed training to solve this problem? If so, can you give me some guidance?

tobias-kirschstein commented 1 year ago

Hi Lee,

I did some experiments to see what model configurations you can fit on an RTX 3090.
The most promising I could find was using only 8 instead of 32 hash encodings and slightly restricting the number of simultaneous samples that are being processed: --n_hash_encodings 8 --latent_dim_time 8 --max_n_samples_per_batch 19 This should still give you a reasonable performance but will be noticeably worse than the full model when the observed movements are very complex. In the paper, we already experimented with using 16 hash encodings, which only marginally impaired the results. Going further down to 8 will have a similar effect. The extreme case would be to only use a single hash encoding, which is equivalent to the NGP + Def. ablation in Table 3 of the paper. The performance will suffer, but it was still on par with DyNeRF in our experiments. So, playing around with the number of hash encodings is a good way to address GPU memory concerns and will still give reasonable results.

So far, I haven't tried running the full model in a distributed manner. But the first thing I would try here is to distribute the hash encodings to different GPUs. The starting point would be the hash ensemble implementation: https://github.com/tobias-kirschstein/nersemble/blob/2424f47022f21dd064a1779f84fec2a0e7cce701/src/nersemble/nerfstudio/field_components/hash_ensemble.py#L102 where we loop over the hash encodings and collect the spatial features. I guess, it shouldn't be too hard to have the hash grids reside on separate GPUs and communicate the 3d positions as well as the queried spatial features with a dedicated main GPU or something.

Hope this helps

LeeHanmin commented 1 year ago

Thanks a lot!

LeeHanmin commented 1 year ago

Hi, I have successfully trained Nersemble and it is awesome. Can I get the video taken by each camera with one of the IDs?

tobias-kirschstein commented 1 year ago

Glad you like it! Not exactly sure what you mean by "get the video taken by each camera with one of the IDs". I assume you are talking about rendering the trained model from each camera? You can get the predictions from the evaluation cameras by running the evaluation script (see section 3.2. in the README). Use the flags --skip_timesteps 3 --max_eval_timesteps -1 to tell the evaluation script that you want to render every 3rd timestep (=24.3fps). The rendered images will be put into some subfolder in ${NERSEMBLE_MODELS_PATH}/NERS-XXX-${name}/evaluation. From there, it should be straightforward to pack the rendered images into a video.

LeeHanmin commented 1 year ago

Sorry I wasn't clear enough. What I mean is that I want the video of 16 monocular cameras with a certain id such as 124 from the first frame to the last frame. Could you please provide it to me?

tobias-kirschstein commented 1 year ago

Sorry, I still don't quite understand your request. What exactly do you need? Do you need the 16 videos of a person from the dataset to train NeRSemble? In that case section 2 of the README describes that. But since you wrote "I have successfully trained Nersemble" above, I assumed you just want to render a trained model from the 12 training and 4 evaluation viewpoints. But my last comment describes how to get those renderings. Not sure what other "video of 16 monocular cameras" you are referring to? Do you maybe mean the circular renderings as in the teaser image in the README?