Specifying "mask_path" in dataset causes runtime error

emd4600 commented 1 year ago

Describe the bug It is impossible to train datasets with mask on Nerfacto due to a runtime error.

To Reproduce Steps to reproduce the behavior:

Make a dataset that uses mask_path
Attempt to train it (I tried with Nerfacto, I don't know if it fails on other methods)

Expected behavior After the dataset loads, the following error appears:

....
  File "/home/eric/nerfstudio/nerfstudio/engine/trainer.py", line 203, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/home/eric/nerfstudio/nerfstudio/utils/profiler.py", line 43, in wrapper
    ret = func(*args, **kwargs)
  File "/home/eric/nerfstudio/nerfstudio/engine/trainer.py", line 371, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/home/eric/nerfstudio/nerfstudio/utils/profiler.py", line 43, in wrapper
    ret = func(*args, **kwargs)
  File "/home/eric/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 255, in get_train_loss_dict
    ray_bundle, batch = self.datamanager.next_train(step)
  File "/home/eric/nerfstudio/nerfstudio/data/datamanagers/base_datamanager.py", line 418, in next_train
    batch = self.train_pixel_sampler.sample(image_batch)
  File "/home/eric/nerfstudio/nerfstudio/data/pixel_samplers.py", line 197, in sample
    pixel_batch = self.collate_image_dataset_batch(
  File "/home/eric/nerfstudio/nerfstudio/data/pixel_samplers.py", line 99, in collate_image_dataset_batch
    collated_batch = {
  File "/home/eric/nerfstudio/nerfstudio/data/pixel_samplers.py", line 100, in <dictcomp>
    key: value[c, y, x] for key, value in batch.items() if key != "image_idx" and value is not None
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Potential solution The error happens because the pixel sampler uses data from the mask image to get the indices to sample the RGB image itself. But the RGB image is on the CPU, whereas the mask is on the GPU. This is caused by the _get_collated_batch() method in CacheDataloader, which moves all batch data (including the mask) to the GPU, except the image. This could be fixed by changing dataloaders.py:110 to:

collated_batch = get_dict_to_torch(collated_batch, device=self.device, exclude=["image", "mask"])

(and maybe change in line 186 as well, if masking is used in evaluation)

nepfaff commented 1 year ago

I'm getting the same error.

Using collated_batch = get_dict_to_torch(collated_batch, device=self.device, exclude=["image", "mask"]) works for me but is quite slow. Using collated_batch = get_dict_to_torch(collated_batch, device=self.device) would be faster if your GPU memory is big enough.

machenmusik commented 1 year ago

Confirmed, thanks for the suggestion @nepfaff - with semantic-nerfw, looks like changes from #1467 are also needed, but having everything on GPU is indeed both working and fast.

machenmusik commented 1 year ago

Does latest https://github.com/nerfstudio-project/nerfstudio/commit/a91b92f89d401115b42e6ad295a8b850e02ee86f solve this for others as well? If so, we can close.

VladMVLX commented 1 year ago

I am having the same issue trying to train nerfacto with masks at version 0.19

nepfaff commented 1 year ago

Does latest a91b92f solve this for others as well? If so, we can close.

This has been undone here, so the issue is back.

Tao-11-chen commented 1 year ago

Well, if you want to make it work, just add three lines: c = c.cpu() y = y.cpu() x = x.cpu() to nerfstudio/data/pixel_sampler.py after line 98.

Tao-11-chen commented 1 year ago

The author seems to put the masks to CUDA to acclerate the pixel sampling so the sampled indices are on CUDA too, using indices on CUDA to index images on CPU causes this issue.

nepfaff commented 1 year ago

Should now be solved: https://github.com/nerfstudio-project/nerfstudio/pull/1741/commits/72a5700aa86c75274ca1062c729722f65f3e4bd7

nerfstudio-project / nerfstudio

Specifying "mask_path" in dataset causes runtime error #1465