CUDA Error during validation

gafniguy commented 1 year ago

Hi Mustafa and the team,

I seem to have managed to install it correctly, and following your example code I get the sample dataset and try to train a model using the example config. It seems to train well, but with every validation step I hit a CUDA error on a tinycudann function . Any idea what it could be?

[I am able to respawn training from the saved checkpoint and that's why in the log you see it at 10k iterations, but it happens from the first validation step after 2,5k iteration, every time]

Many thanks for the repo and support Guy


(sdfstudio) [user@pc humanrf]$ CUDA_LAUNCH_BLOCKING=1 humanrf/run.py     --config example_humanrf     --workspace /tmp/example_workspace     --dataset.path data/actorshq/
/tmp/example_workspace
Running adaptive temporal partitioning with threshold 1.25: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:01<00:00, 29.59it/s]
Images are loaded in 1.334407091140747 seconds by a pool of 64 processes.
Images are loaded in 0.045470476150512695 seconds by a pool of 1 processes.
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/home/gafni/.conda/envs/sdfstudio/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/home/gafni/.conda/envs/sdfstudio/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Loading model from: /home/gafni/.conda/envs/sdfstudio/lib/python3.8/site-packages/lpips/weights/v0.1/alex.pth
[INFO] # parameters: 29.840 million
[INFO] Loading the checkpoint from /tmp/example_workspace/checkpoints/step_00007500.pth ...
[INFO] The full state is loaded at step 7500
  0%|                                                                                                                                                                                          | 0/50001 [00:00<?, ? total training iterations/s][W BinaryOps.cpp:594] Warning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (function operator())
loss=0.00010 (exp avg loss=0.00009), lr=0.008706:  20%|████████████████████████▌                                                                                                  | 9999/50001 [04:08<1:04:34, 10.32 total training iterations/s]─────────────────────────────────────────────────────────────────────────────────────────────────────────── Validation at Step 10000 ────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                Traceback (most recent call last):9 ssim=0.9414  --- AVERAGE: psnr=29.6805 lpips=0.0639 ssim=0.9414 :   5%|█████▏                                                                                                  | 1/20 [00:01<00:21,  1.12s/it]
  File "humanrf/run.py", line 118, in <module>
    trainer.train(training_data_loader, validation_data_loader, max_steps=config.training.max_steps)
  File "/home/gafni/projects/humanrf/humanrf/trainer.py", line 196, in train
    self.validate(validation_data_loader)
  File "/home/gafni/.conda/envs/sdfstudio/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/gafni/projects/humanrf/humanrf/trainer.py", line 301, in validate
    partial_render_output = render(
  File "/home/gafni/projects/humanrf/humanrf/volume_rendering.py", line 120, in render
    queried_props = scene_representation(query_input)
  File "/home/gafni/.conda/envs/sdfstudio/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/gafni/projects/humanrf/humanrf/scene_representation/humanrf.py", line 206, in forward
    output.radiance = self.color_net(torch.cat(color_net_input, dim=-1))
  File "/home/gafni/.conda/envs/sdfstudio/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/gafni/.conda/envs/sdfstudio/lib/python3.8/site-packages/tinycudann/modules.py", line 189, in forward
    self.params.to(_torch_precision(self.native_tcnn_module.param_precision())).contiguous(),
RuntimeError: CUDA error: invalid configuration argument
CURRENT: psnr=29.6805 lpips=0.0639 ssim=0.9414  --- AVERAGE: psnr=29.6805 lpips=0.0639 ssim=0.9414 :   5%|█████▏                                                                                                  | 1/20 [00:01<00:24,  1.28s/it]
loss=0.00010 (exp avg loss=0.00009), lr=0.008706:  20%|████████████████████████▊                                                                                                   | 10000/50001 [04:10<16:40, 39.99 total training iterations/s]

isikmustafa commented 1 year ago

Hey Guy, could you try reducing the parameter here.

As for the reason, during training, we randomly sample rays and have similar number of samples on average per ray. However, during validation we sample rays uniformly. So, some batches hit background completely and some do hit the actor which causes having high variance on the number of samples on the rays. I hope this helps if you further experience this issue

gafniguy commented 1 year ago

Hey Mustafa, thanks for the prompt reply!

What you say makes sense,I tried reducing the number of rays but it didn't change it (probably would have been the fix for a CUDA OOM error). After looking further, I think there the problem might be here

    def forward(self, query_input: QueryInput) -> QueryOutput:
        output = self.density(query_input)

        # query_input.directions has value range [-1, 1]. To query the color_net this is transformed to [0, 1].
        color_net_input = [(query_input.directions + 1) * 0.5, output.geometry_features]
        if self.camera_embedding_dim > 0:
            if query_input.is_training:
                color_net_input.append(self.camera_embeddings(query_input.camera_numbers.squeeze(1).long()))
            else:
                **# Using zeros during validation&test time.**
                color_net_input.append(
                    torch.zeros(
                        (query_input.directions.shape[0], self.camera_embedding_dim),
                        dtype=query_input.directions.dtype,
                        device=query_input.directions.device,
                    )
                )

        output.radiance = self.color_net(torch.cat(color_net_input, dim=-1))

        return output

What ends up happening is that during validation, self.color_net is being called with an input of shape [0,20] (in training it's roughly [640000,20]. I wonder if in your setup you also get shape [0,20] and it's handled fine?

Edit: attaching screenshot with prints of shapes. Seems like it calls this forward function twice during val, the second time has query_input.directions.shape[0] = 0.

gafniguy commented 1 year ago

Hi again @isikmustafa , I have adapted my data (so far just with AABB) and experience a similar issue in validation. I have localized the issue to be after the CUDA kernel in the next function of the dataloader, in the section that is only for val/test.

In the screenshot (bottom part) you can see the inputs to get_samples_aabb_minmax seem to make sense (I am also able to view the cameras properly with your snippet after invoking the dataloader), and on the left you can see all the outputs (even the pixel colors) for the batch have the first dimension of the shape as 0. This is happening on the last image of the validation set.

Really appreciate any tips you have :) Thanks!

isikmustafa commented 1 year ago

Hey Guy, this is something we have seen as well. After the ray sampler is called, the rays that hit the background are eliminated. So, it is possible to see shapes like (0, 3) etc. If you just run at default config without changing anything this shouldn’t really happen.

So, I’m wondering if this could be related to somehow the pytorch or tiny-cuda-nn version. Because they might’ve changed behavior.

One hotfix could be to do if checks for these scenarios and not run the networks.

I’m on a vacation now, so I can’t dig deep into the code but I’ll take a closer look if the problem persists somehow.

gafniguy commented 1 year ago

Thanks! that makes sense. For now I am working on a workaround to just skip the networks as you suggested, by creating an input batch of zeros with unmasked rays that have RenderOutput of zeros.

After having implemented that, I noticed that if I get a batch with just 1 ray to compute, I encounter the same error... (I change the hyperparams around a bit and then it happens less)

Do you remember what version of tinycudann did you use? Mine shows 1.7

Enjoy your time there :)

ZhaoLizz commented 1 year ago

@gafniguy @isikmustafa Hey guys! Thanks for all your comments! I have encountered the same issue for validation only (training works fine). The input of self.ray_sampler_func is valid but the output results are empty. My tinycudann shows 1.7 as well. Have you solve this problem yet? Thanks for any tips in advance!

ZhaoLizz commented 1 year ago

I guess I just figured this out. This error is caused by the input rays sampled uniformly during validation all hit the background and are all filtered out. My solution is to also sample rays randomly during validation but keep each pixel sampled once and only once. To do so, we can simply define a pixel-level mapping tensor as

# `__init__` of `data_loader.py` (Line 356)
self.random_pixel_indices = torch.randperm(self.num_pixels_per_camera, device=self.device)

and during define ray sampling indices in __next__ (Line 582), we can simply do a index mapping like:

# data_loader, __next__, Line 582
ray_indices = torch.arange(ray_index_start, ray_index_end, dtype=torch.int64, device=self.device)
ray_indices = self.random_pixel_indices[ray_indices] # random 1-to-1 projection

Therefore, the randomly sampled rays are most likely not all hit the background; thus, the validation can run properly.

isikmustafa commented 10 months ago

I just noticed having an empty batch might cause issues in some versions of pytorch (e.g., 2.1) while others don't lead to this. So, I'm guessing there might be other relevant factors as well but as @ZhaoLizz pointed out (many thanks!), this problem can be easily avoided by using random indices.

synthesiaresearch / humanrf

CUDA Error during validation #15