skhu101 / SHERF

Code for our ICCV'2023 paper "SHERF: Generalizable Human NeRF from a Single Image"
Other
302 stars 10 forks source link

Error in training #15

Closed yejr0229 closed 10 months ago

yejr0229 commented 1 year ago

Thank you for your great work.I follow your install instruction,and use one RTX 3090 for training,but I encounter this error:

In file included from /home/yejr/.cache/torch_extensions/py38_cu113/bias_act_plugin/b46266ff65f9fa53c32108953a1c6f16-nvidia-geforce-rtx-3090/bias_act.cpp:14:0: /home/user/miniconda3/envs/sherf/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:10: fatal error: cuda_runtime_api.h: No such file or directory

include

      ^~~~~~~~~~~~~~~~~~~~

compilation terminated. ninja: build stopped: subcommand failed.

can you help me deal with it?

yejr0229 commented 1 year ago

And I change to 4 1080 Ti for training,similar error happends:

ImportError: /home/yejr/.cache/torch_extensions/py38_cu113/bias_act_plugin/b46266ff65f9fa53c32108953a1c6f16-nvidia-geforce-gtx-1080-ti/bias_act_plugin.so: cannot open shared object file: No such file or directory

skhu101 commented 1 year ago

Hi, thanks for reporting the issue. I think it is due to the compilation of the CUDA operation in stylegan series. I find the code works in CUDA 11.3 and V100 GPU. I fins one issue in stylegan3 repo (https://github.com/NVlabs/stylegan3/issues/165). Hope this can help you solve the problem.

yejr0229 commented 1 year ago

Thank you so much,but the issue in stylegan3 did not help me,the ninja_build is still failed...

skhu101 commented 1 year ago

Hi, I am curious about which cuda version do you use in your experiments. If you follow the same setting as ours, i.e., python 3.8 + cuda 11.3 + pytorch 1.11.0, do you still have the same error above?

yejr0229 commented 1 year ago

import torch print(torch>>> print(torch.version) 1.11.0 print(torch.version.cuda) 11.3 exit() (sherf) python -V Python 3.8.18

I follow the same setting as yours.

skhu101 commented 1 year ago

Hi, it is quite strange. I reinstall the conda environment in our cluster and finds it works well. By the way, we use V100 GPU and gcc-5.4.0. I suspect it is due to different versions of GPUs. But I am not 100% sure. Could you retry installing the stylegan3 or EG3D environment and see whether you can run experiments successfully?

yejr0229 commented 1 year ago

The above problem has been solved by removing lock document from torch_extensions,thank you so much. But now another problem come out during training,it's all fine and the network is training well when I debugging. Here is the detailed error: Traceback (most recent call last): File "train.py", line 449, in main() # pylint: disable=no-value-for-parameter File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(args, kwargs) File "train.py", line 444, in main launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) File "train.py", line 101, in launch_training subprocess_fn(rank=0, c=c, temp_dir=temp_dir) File "train.py", line 52, in subprocess_fn training_loop.training_loop(rank=rank, c) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/training_loop.py", line 365, in training_loop loss_Gmain_train, img_loss_raw_train, acc_loss_raw_train, ssim_raw_train, lpips_raw_train, loss_Gmain_Dgen = loss.accumulate_gradients(phase=phase.name, input_data=input_data, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg, use_sr_module=use_sr_module, recons_loss=recons_loss, rank=rank) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/loss.py", line 149, in accumulate_gradients gen_img, _gen_ws = self.run_G(input_data, gen_z, gen_c, swapping_prob=swapping_prob, neural_rendering_resolution=neural_rendering_resolution, use_sr_module=use_sr_module) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/loss.py", line 82, in run_G gen_output = self.G.synthesis(ws, input_data, c, neural_rendering_resolution=neural_rendering_resolution, File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/triplane.py", line 168, in synthesis sr_image = self.superresolution(rgb_image, feature_image, ws, File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/superresolution.py", line 288, in forward x, rgb = self.block0(x, rgb, ws, block_kwargs) File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, **kwargs) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/networks_stylegan2.py", line 439, in forward misc.assert_shape(x, [None, self.in_channels, self.resolution // 2, self.resolution // 2]) File "/home/yejr/Digital_Avater/SHERF-main/sherf/torch_utils/misc.py", line 97, in assert_shape raise AssertionError(f'Wrong size for dimension {idx}: got {size}, expected {ref_size}') AssertionError: Wrong size for dimension 1: got 3, expected 32

When x.shape=torch.Size([1, 3, 128, 128]),the error come out.

skhu101 commented 1 year ago

Hi, we disable the superresolution module in our training script by using the argument --use_sr_module False.

yejr0229 commented 1 year ago

The problem has been solved,thank you so much. And I find the gpu memory is fluctuating from 7000mb to 23000mb,it's the first time I saw such drastic fluctuations in gpu memory during training, it should generally stable at a certain value,is it normal in sherf training? The total memory of 3090 is 24000mb,so I am afraid this will cause OOM problem during training,should I decrease the batch or workers?now I'm using two 3090 and set batch=2,workers=2.

yejr0229 commented 12 months ago

But when I train the model on renderpeople dataset,I encountered CUDA out of memory. I use one 3090,and set batch=1,workers=0

skhu101 commented 11 months ago

Hi, thanks for reporting the issue. We did not observe the CUDA out of memory issue in V100. I think possible solutions are to decrease the number of samples in a ray or further decrease chunk size in line 355 of file in sherf/training/volumetric_rendering/renderer.py.

yejr0229 commented 11 months ago

Thanks for replying.And when I validating the model on zjumocap,I found the image in CoreVIew_313 and CoreView_315 can not be read,the image'name is like:CoreView_313Camera(1)_1441_2019-08-23_16-09-37.378.jpg,but the network can't read and get the error: FileNotFoundError: No such file: '/media/data4/zju_mocap/CoreView_313/Camera (1)/0001.jpg'

skhu101 commented 11 months ago

Hi, I have checked the zju_mocap data set and find this file do exist in the directory. Could you have a further check? If you have further questions, please let me know.

greatbaozi001 commented 11 months ago

But when I train the model on renderpeople dataset,I encountered CUDA out of memory. I use one 3090,and set batch=1,workers=0

I have the same problem, and i have tried to decrease chunk size even to 50000, but it seems helpless.

skhu101 commented 11 months ago

Hi, thanks for letting me know. Maybe one possible reason is due to that V100 has large GPU memory than 3090. In our implementation, we render the whole image. One possible solution is to calculate the loss on patches. I will provide a training recipe with patch-level loss soon. Or you can implement it by modifying the sampling function.

TonNew5418 commented 9 months ago

@yejr0229 Hi I meet the same problem as you fatal error: cuda_runtime_api.h: No such file or directory Could you please tell me how to solve this?