Closed gafniguy closed 10 months ago
Hey Guy, could you try reducing the parameter here.
As for the reason, during training, we randomly sample rays and have similar number of samples on average per ray. However, during validation we sample rays uniformly. So, some batches hit background completely and some do hit the actor which causes having high variance on the number of samples on the rays. I hope this helps if you further experience this issue
Hey Mustafa, thanks for the prompt reply!
What you say makes sense,I tried reducing the number of rays but it didn't change it (probably would have been the fix for a CUDA OOM error). After looking further, I think there the problem might be here
def forward(self, query_input: QueryInput) -> QueryOutput:
output = self.density(query_input)
# query_input.directions has value range [-1, 1]. To query the color_net this is transformed to [0, 1].
color_net_input = [(query_input.directions + 1) * 0.5, output.geometry_features]
if self.camera_embedding_dim > 0:
if query_input.is_training:
color_net_input.append(self.camera_embeddings(query_input.camera_numbers.squeeze(1).long()))
else:
**# Using zeros during validation&test time.**
color_net_input.append(
torch.zeros(
(query_input.directions.shape[0], self.camera_embedding_dim),
dtype=query_input.directions.dtype,
device=query_input.directions.device,
)
)
output.radiance = self.color_net(torch.cat(color_net_input, dim=-1))
return output
What ends up happening is that during validation, self.color_net is being called with an input of shape [0,20] (in training it's roughly [640000,20]. I wonder if in your setup you also get shape [0,20] and it's handled fine?
Edit: attaching screenshot with prints of shapes. Seems like it calls this forward function twice during val, the second time has query_input.directions.shape[0] = 0.
Hi again @isikmustafa , I have adapted my data (so far just with AABB) and experience a similar issue in validation. I have localized the issue to be after the CUDA kernel in the next function of the dataloader, in the section that is only for val/test.
In the screenshot (bottom part) you can see the inputs to get_samples_aabb_minmax seem to make sense (I am also able to view the cameras properly with your snippet after invoking the dataloader), and on the left you can see all the outputs (even the pixel colors) for the batch have the first dimension of the shape as 0. This is happening on the last image of the validation set.
Really appreciate any tips you have :) Thanks!
Hey Guy, this is something we have seen as well. After the ray sampler is called, the rays that hit the background are eliminated. So, it is possible to see shapes like (0, 3) etc. If you just run at default config without changing anything this shouldn’t really happen.
So, I’m wondering if this could be related to somehow the pytorch or tiny-cuda-nn version. Because they might’ve changed behavior.
One hotfix could be to do if checks for these scenarios and not run the networks.
I’m on a vacation now, so I can’t dig deep into the code but I’ll take a closer look if the problem persists somehow.
Thanks! that makes sense. For now I am working on a workaround to just skip the networks as you suggested, by creating an input batch of zeros with unmasked rays that have RenderOutput of zeros.
After having implemented that, I noticed that if I get a batch with just 1 ray to compute, I encounter the same error... (I change the hyperparams around a bit and then it happens less)
Do you remember what version of tinycudann did you use? Mine shows 1.7
Enjoy your time there :)
@gafniguy @isikmustafa Hey guys! Thanks for all your comments! I have encountered the same issue for validation only (training works fine). The input of self.ray_sampler_func is valid but the output results are empty. My tinycudann shows 1.7 as well. Have you solve this problem yet? Thanks for any tips in advance!
I guess I just figured this out. This error is caused by the input rays sampled uniformly during validation all hit the background and are all filtered out. My solution is to also sample rays randomly during validation but keep each pixel sampled once and only once. To do so, we can simply define a pixel-level mapping tensor as
# `__init__` of `data_loader.py` (Line 356)
self.random_pixel_indices = torch.randperm(self.num_pixels_per_camera, device=self.device)
and during define ray sampling indices in __next__ (Line 582)
, we can simply do a index mapping like:
# data_loader, __next__, Line 582
ray_indices = torch.arange(ray_index_start, ray_index_end, dtype=torch.int64, device=self.device)
ray_indices = self.random_pixel_indices[ray_indices] # random 1-to-1 projection
Therefore, the randomly sampled rays are most likely not all hit the background; thus, the validation can run properly.
I just noticed having an empty batch might cause issues in some versions of pytorch (e.g., 2.1) while others don't lead to this. So, I'm guessing there might be other relevant factors as well but as @ZhaoLizz pointed out (many thanks!), this problem can be easily avoided by using random indices.
Hi Mustafa and the team,
I seem to have managed to install it correctly, and following your example code I get the sample dataset and try to train a model using the example config. It seems to train well, but with every validation step I hit a CUDA error on a tinycudann function . Any idea what it could be?
[I am able to respawn training from the saved checkpoint and that's why in the log you see it at 10k iterations, but it happens from the first validation step after 2,5k iteration, every time]
Many thanks for the repo and support Guy