RuntimeError: CUDA error: device-side assert triggered

anilesec commented 3 years ago

Dear Author,

Thank you for the cool implementation. I installed successfully and tried to run "python train_nerf.py --config config/lego.yml" But I am getting RuntimeError: CUDA error: device-side assert triggered.

Traceback (most recent call last): File "train_nerf.py", line 404, in main() File "train_nerf.py", line 240, in main encode_direction_fn=encode_direction_fn, File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/train_utils.py", line 180, in run_one_iter_of_nerf for batch in batches File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/train_utils.py", line 180, in for batch in batches File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/train_utils.py", line 115, in predict_and_render_radiance encode_direction_fn, File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/train_utils.py", line 11, in run_network embedded = embed_fn(pts_flat) File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/nerf_helpers.py", line 166, in x, num_encoding_functions, include_input, log_sampling File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/nerf_helpers.py", line 138, in positional_encoding device=tensor.device, File "/home/aswamy/tools/anaconda3/envs/nerf-pytorch-krish/lib/python3.7/site-packages/torch/tensor.py", line 27, in wrapped return f(*args, kwargs) File "/home/aswamy/tools/anaconda3/envs/nerf-pytorch-krish/lib/python3.7/site-packages/torch/tensor.py", line 547, in rpow return torch.tensor(other, dtype=dtype, device=self.device) self RuntimeError: CUDA error: device-side assert triggered

Any suggestions to solve this?

Thank you!

krrish94 commented 3 years ago

It's hard to tell without knowing the exact config, but does this issue seem to help you? https://github.com/krrish94/nerf-pytorch/issues/9

anilesec commented 3 years ago

I tried reducing the chunk size and the num of layers. But the error still persists. Besides, I do not have issues with gpu memory. If you are talking about the model config file, I just used the default config file given in the github.

pgmsuper commented 3 years ago

I think the number of labels is wrong,causes error when calculating loss value

anilesec commented 3 years ago

but there is no number of labels involved as per my understanding

anilesec commented 3 years ago

It's hard to tell without knowing the exact config, but does this issue seem to help you?

9

@krrish94 Here is more information about the error: Looks like the error is in file (/nerf-pytorch/nerf/nerf_helpers.py, line 301; cdf_g = torch.gather(cdf.unsqueeze(1).expand(matched_shape), 2, inds_g))

/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [47,0,0], thread: [47,0,0] Ass ertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [47,0,0], thread: [48,0,0] Ass ertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [47,0,0], thread: [50,0,0] Ass ertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [47,0,0], thread: [53,0,0] Ass ertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [47,0,0], thread: [54,0,0] Ass ertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [47,0,0], thread: [56,0,0] Ass ertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [47,0,0], thread: [59,0,0] Ass ertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [47,0,0], thread: [60,0,0] Ass ertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [47,0,0], thread: [62,0,0] Ass ertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. 0%| | 0/200000 [00:08<?, ?it/s] Traceback (most recent call last): File "train_nerf.py", line 406, in main() File "train_nerf.py", line 242, in main encode_direction_fn=encode_direction_fn, File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/train_utils.py", line 180, in run_on e_iter_of_nerf for batch in batches File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/train_utils.py", line 180, in for batch in batches File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/train_utils.py", line 101, in predic t_and_render_radiance det=(getattr(options.nerf, mode).perturb == 0.0), File "/home/aswamy/github_repos/NeRF/nerf-pytorch-krish/nerf-pytorch/nerf/nerf_helpers.py", line 301, in sampl e_pdf_2 cdf_g = torch.gather(cdf.unsqueeze(1).expand(matched_shape), 2, inds_g) RuntimeError: CUDA error: device-side assert triggered

anilesec commented 3 years ago

@krrish94 Following the preview comment: I tried to print the tensors and found that values of tensor inds_g is too large and too small(which casuses out of bounds error) inds_g.min() = tensor(-4993021444723710459) inds_g.max() = tensor(4575432887736600530) inds_g = tensor([[[ 4255818524050935954, 62], [ 4256250760978027750, 62], [ 4237722774569238629, 62], ...,]]

usage of this tensor: file "run_nerf_helpers.py" cdf_g = torch.gather(cdf.unsqueeze(1).expand(matched_shape), 2, inds_g)

anilesec commented 3 years ago

More update: Actually these large values of indices are coming from func torchsearchsorted.searchsorted() inds = torchsearchsorted.searchsorted(cdf, u, side="right") --> after this line of code inds values are very extreme(out of bounds)

anilesec commented 3 years ago

More update: Actually these large values of indices are coming from func torchsearchsorted.searchsorted() inds = torchsearchsorted.searchsorted(cdf, u, side="right") --> after this line of code inds values are very extreme(out of bounds)

It may be related to the issue you created for searchsorted() @krrish94

anilesec commented 3 years ago

More update: Actually these large values of indices are coming from func torchsearchsorted.searchsorted() inds = torchsearchsorted.searchsorted(cdf, u, side="right") --> after this line of code inds values are very extreme(out of bounds)

It may be related to the issue you created for searchsorted() @krrish94

I replaced torchsearchsorted.searchsorted() with official torch.searchsorted(), now the error is gone and in runs successfully, though I am not sure influence on performance due to this change. I think it may be worth mentioning somewhere because I spent some time to get this :)

Thank you!

krrish94 commented 3 years ago

Thanks so much for digging into this! From a skim this appears to be due to a weird config that's potentially leading to indexing errors. I'd trust the newer torch searchsorted function as opposed to the external package.

pgmsuper commented 3 years ago

can you update your code?because I change my code but it's not work

pgmsuper commented 3 years ago

I think you can try to upgrade your python's libraries, such as numpy and so on,I do that and succed run it

krrish94 / nerf-pytorch

RuntimeError: CUDA error: device-side assert triggered #22

9

@krrish94 Here is more information about the error: Looks like the error is in file (/nerf-pytorch/nerf/nerf_helpers.py, line 301; cdf_g = torch.gather(cdf.unsqueeze(1).expand(matched_shape), 2, inds_g))