nerfstudio-project / nerfacc

A General NeRF Acceleration Toolbox in PyTorch.
https://www.nerfacc.com/
Other
1.38k stars 112 forks source link

Core Dumped after a few iterations #130

Closed alvaro-budria closed 1 year ago

alvaro-budria commented 1 year ago

Thanks for the good work. I was trying out the CUDA accelerated rendering provided in this repo. I am trying to use the alpha-based rendering.

So far it seems that training proceeds for a few epochs (~1-300, usually not more), and then an error pops up.

I tried executing the script both with CUDA_LAUNCH_BLOCKING=1 and without it. When this flag is set, the error is

Aborted (core dumped)

without any further clarifications, and it always happens when running the following line:

self.grad_scaler.scale(loss).backward()

Without this flag, the error message is

Traceback (most recent call last):
  File "/home/abudria/VQAD_I-ngp-enhanced-NeuS/exp_runner.py", line 527, in <module>
    runner.train()
  File "/home/abudria/VQAD_I-ngp-enhanced-NeuS/exp_runner.py", line 193, in train
    self.grad_scaler.scale(loss).backward()
  File "/home/abudria/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/abudria/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

and as you can see from the error trace, it is the same line that leads to the crash

self.grad_scaler.scale(loss).backward()

Reducing the number of samples per ray and the batch size to extremely small values does drive away the error. However, I have strong doubts that this is an OOM problem because:

  1. Before using nerfacc's CUDA rendering, I was using my own pure PyTorch pipeline with a large batch size and a large number of samples per ray.
  2. I am monitoring the GPU memory consumption with watch -d -n 0.01 nvidia-smi. At no point does the memory shoot up to the 23028MiB limit of the server's NVIDIA A10.

The rendering code I have is

        sdf_grad_samples = []

        def alpha_fn(t_starts, t_ends, ray_indices):
            ray_indices = ray_indices.long()
            t_origins = rays_o[ray_indices]
            t_dirs = rays_d[ray_indices]
            midpoints = (t_starts + t_ends) / 2.
            positions = t_origins + t_dirs * midpoints
            sdf = self.geometry.sdf(positions,)
            sdf_grad = self.geometry.gradient(positions,).squeeze()
            normal = F.normalize(sdf_grad, p=2, dim=-1)
            dists = t_ends - t_starts
            alpha = self.get_alpha(sdf, normal, t_dirs, dists)
            return alpha[...,None]

        def rgb_alpha_fn(t_starts, t_ends, ray_indices):
            ray_indices = ray_indices.long()
            t_origins = rays_o[ray_indices]
            t_dirs = rays_d[ray_indices]
            midpoints = (t_starts + t_ends) / 2.
            positions = t_origins + t_dirs * midpoints
            geometry = self.geometry(positions,)
            sdf, feature = geometry[:, :1], geometry[:, 1:]
            sdf_grad = self.geometry.gradient(positions,)
            sdf_grad_samples.append(sdf_grad)
            dists = t_ends - t_starts
            normal = F.normalize(sdf_grad, p=2, dim=-1)
            alpha = self.get_alpha(sdf, normal, t_dirs, dists)
            rgb = self.texture(positions, normal, t_dirs, feature)
            return rgb, alpha[..., None]

        with torch.no_grad():
            packed_info, t_starts, t_ends = ray_marching(
                rays_o, rays_d,
                scene_aabb=self.scene_aabb,
                grid=None,  # self.occupancy_grid if self.grid_prune else None,
                alpha_fn=alpha_fn,
                near_plane=near.squeeze(), far_plane=far.squeeze(),
                render_step_size=self.render_step_size,
                stratified=self.randomized,
                cone_angle=0.0,
                alpha_thre=0.0,
            )

        rgb, opacity, depth = rendering(
            ray_indices=packed_info,
            t_starts=t_starts,
            t_ends=t_ends,
            n_rays=len(rays_o),
            rgb_alpha_fn=rgb_alpha_fn,
            # render_bkgd=self.background_color,
        )

        sdf_grad_samples = torch.cat(sdf_grad_samples, dim=0)
        opacity, depth = opacity.squeeze(-1), depth.squeeze(-1)
        num_samples = torch.as_tensor([len(t_starts)], dtype=torch.int32, device=rays_o.device)

Any ideas?

alvaro-budria commented 1 year ago

I found the problem and a solution for it.

Turns out that at some point I was calling the following snippet in my geometry network:

def gradient(self, x,):
        with torch.set_grad_enabled(True):
            x.requires_grad_(True)
            y = self.sdf(x)
            gradients = torch.autograd.grad(
                outputs=y,
                inputs=x,
                grad_outputs=torch.ones_like(y),
                create_graph=True,
                retain_graph=True,
                only_inputs=True,
            )[0]
        return gradients.reshape(-1, 3)

The line

x.requires_grad_(True)

was causing the graph on x to be empty, and thus an illegal memory access happened, or at least that's the explanation I could come up with. At any rate, changing the line above for

x = x + 0 * self.__hidden__(x)

solved the problem! I also needed to add this

self.__hidden__ = torch.nn.Linear(3, 1, bias=False)

to the __init__ of my class.

CCamouflage-Hvv commented 1 year ago

Thanks for you sharing! Can I ask that how to use "with torch.no_grad():" in "packed_info, t_starts, t_ends = ray_marching( rays_o, rays_d, scene_aabb=self.scene_aabb, grid=None, # self.occupancy_grid if self.grid_prune else None, alpha_fn=alpha_fn, near_plane=near.squeeze(), far_plane=far.squeeze(), render_step_size=self.render_step_size, stratified=self.randomized, cone_angle=0.0, alpha_thre=0.0, )" To be specific, the gradient calculation in alpha_fn has to use grad_fn of pytorch. It's that possible to use "with torch.no_grad():" here? Thanks again!

alvaro-budria commented 1 year ago

Yes, you can wrap the ray_marching in a with torch.no_grad(), however inside the forward of the geometry module, you need to add a line enabling grad for the forward pass there: with torch.set_grad_enabled(self.training or (with_grad and self.grad_type == 'analytic')):.

You can check the repo I based my code on, to see the details: https://github.com/bennyguo/instant-nsr-pl/tree/main/models