Closed yejr0229 closed 10 months ago
And I change to 4 1080 Ti for training,similar error happends:
ImportError: /home/yejr/.cache/torch_extensions/py38_cu113/bias_act_plugin/b46266ff65f9fa53c32108953a1c6f16-nvidia-geforce-gtx-1080-ti/bias_act_plugin.so: cannot open shared object file: No such file or directory
Hi, thanks for reporting the issue. I think it is due to the compilation of the CUDA operation in stylegan series. I find the code works in CUDA 11.3 and V100 GPU. I fins one issue in stylegan3 repo (https://github.com/NVlabs/stylegan3/issues/165). Hope this can help you solve the problem.
Thank you so much,but the issue in stylegan3 did not help me,the ninja_build is still failed...
Hi, I am curious about which cuda version do you use in your experiments. If you follow the same setting as ours, i.e., python 3.8 + cuda 11.3 + pytorch 1.11.0, do you still have the same error above?
import torch print(torch>>> print(torch.version) 1.11.0 print(torch.version.cuda) 11.3 exit() (sherf) python -V Python 3.8.18
I follow the same setting as yours.
Hi, it is quite strange. I reinstall the conda environment in our cluster and finds it works well. By the way, we use V100 GPU and gcc-5.4.0. I suspect it is due to different versions of GPUs. But I am not 100% sure. Could you retry installing the stylegan3 or EG3D environment and see whether you can run experiments successfully?
The above problem has been solved by removing lock document from torch_extensions,thank you so much. But now another problem come out during training,it's all fine and the network is training well when I debugging. Here is the detailed error: Traceback (most recent call last): File "train.py", line 449, in main() # pylint: disable=no-value-for-parameter File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(args, kwargs) File "train.py", line 444, in main launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) File "train.py", line 101, in launch_training subprocess_fn(rank=0, c=c, temp_dir=temp_dir) File "train.py", line 52, in subprocess_fn training_loop.training_loop(rank=rank, c) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/training_loop.py", line 365, in training_loop loss_Gmain_train, img_loss_raw_train, acc_loss_raw_train, ssim_raw_train, lpips_raw_train, loss_Gmain_Dgen = loss.accumulate_gradients(phase=phase.name, input_data=input_data, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg, use_sr_module=use_sr_module, recons_loss=recons_loss, rank=rank) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/loss.py", line 149, in accumulate_gradients gen_img, _gen_ws = self.run_G(input_data, gen_z, gen_c, swapping_prob=swapping_prob, neural_rendering_resolution=neural_rendering_resolution, use_sr_module=use_sr_module) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/loss.py", line 82, in run_G gen_output = self.G.synthesis(ws, input_data, c, neural_rendering_resolution=neural_rendering_resolution, File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/triplane.py", line 168, in synthesis sr_image = self.superresolution(rgb_image, feature_image, ws, File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/superresolution.py", line 288, in forward x, rgb = self.block0(x, rgb, ws, block_kwargs) File "/home/yejr/miniconda3/envs/sherf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, **kwargs) File "/home/yejr/Digital_Avater/SHERF-main/sherf/training/networks_stylegan2.py", line 439, in forward misc.assert_shape(x, [None, self.in_channels, self.resolution // 2, self.resolution // 2]) File "/home/yejr/Digital_Avater/SHERF-main/sherf/torch_utils/misc.py", line 97, in assert_shape raise AssertionError(f'Wrong size for dimension {idx}: got {size}, expected {ref_size}') AssertionError: Wrong size for dimension 1: got 3, expected 32
When x.shape=torch.Size([1, 3, 128, 128]),the error come out.
Hi, we disable the superresolution module in our training script by using the argument --use_sr_module False.
The problem has been solved,thank you so much. And I find the gpu memory is fluctuating from 7000mb to 23000mb,it's the first time I saw such drastic fluctuations in gpu memory during training, it should generally stable at a certain value,is it normal in sherf training? The total memory of 3090 is 24000mb,so I am afraid this will cause OOM problem during training,should I decrease the batch or workers?now I'm using two 3090 and set batch=2,workers=2.
But when I train the model on renderpeople dataset,I encountered CUDA out of memory. I use one 3090,and set batch=1,workers=0
Hi, thanks for reporting the issue. We did not observe the CUDA out of memory issue in V100. I think possible solutions are to decrease the number of samples in a ray or further decrease chunk size in line 355 of file in sherf/training/volumetric_rendering/renderer.py.
Thanks for replying.And when I validating the model on zjumocap,I found the image in CoreVIew_313 and CoreView_315 can not be read,the image'name is like:CoreView_313Camera(1)_1441_2019-08-23_16-09-37.378.jpg,but the network can't read and get the error: FileNotFoundError: No such file: '/media/data4/zju_mocap/CoreView_313/Camera (1)/0001.jpg'
Hi, I have checked the zju_mocap data set and find this file do exist in the directory. Could you have a further check? If you have further questions, please let me know.
But when I train the model on renderpeople dataset,I encountered CUDA out of memory. I use one 3090,and set batch=1,workers=0
I have the same problem, and i have tried to decrease chunk size even to 50000, but it seems helpless.
Hi, thanks for letting me know. Maybe one possible reason is due to that V100 has large GPU memory than 3090. In our implementation, we render the whole image. One possible solution is to calculate the loss on patches. I will provide a training recipe with patch-level loss soon. Or you can implement it by modifying the sampling function.
@yejr0229 Hi I meet the same problem as you fatal error: cuda_runtime_api.h: No such file or directory Could you please tell me how to solve this?
Thank you for your great work.I follow your install instruction,and use one RTX 3090 for training,but I encounter this error:
In file included from /home/yejr/.cache/torch_extensions/py38_cu113/bias_act_plugin/b46266ff65f9fa53c32108953a1c6f16-nvidia-geforce-rtx-3090/bias_act.cpp:14:0: /home/user/miniconda3/envs/sherf/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:10: fatal error: cuda_runtime_api.h: No such file or directory
include
compilation terminated. ninja: build stopped: subcommand failed.
can you help me deal with it?