google-research / nerf-from-image

Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion
Apache License 2.0
377 stars 18 forks source link

Process error when attempting to do inference or training #8

Open jclarkk opened 1 year ago

jclarkk commented 1 year ago
loading datasets\cub\data\test_cub_cleaned.mat
2874 images
  0%|          | 0/373 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "Python\Python310\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "Python\Python310\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "Python\Python310\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "Python\Python310\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "Python\Python310\lib\runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "Python\Python310\lib\runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "nerf-from-image\run.py", line 157, in <module>
    dataset_config, train_split, train_eval_split, test_split = loaders.load_dataset(
  File "nerf-from-image\data\loaders.py", line 222, in load_dataset
    train_split, train_eval_split, test_split = loader(dataset_config, args,
  File "nerf-from-image\data\loaders.py", line 303, in load_custom
    for i, sample in enumerate(tqdm(loader)):
  File "venv\lib\site-packages\tqdm\std.py", line 1178, in __iter__
    for obj in iterable:
  File "venv\lib\site-packages\torch\utils\data\dataloader.py", line 442, in __iter__
    return self._get_iterator()
  File "venv\lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "venv\lib\site-packages\torch\utils\data\dataloader.py", line 1043, in __init__
    w.start()
  File "Python\Python310\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "Python\Python310\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "Python\Python310\lib\multiprocessing\context.py", line 336, in _Popen
    return Popen(process_obj)
  File "Python\Python310\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "Python\Python310\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "Python\Python310\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

RTX 3080 8GB Windows 10

Python: 3.10 Dependencies: torch==2.0.0+cu117 torchvision==0.15.1+cu117 imageio==2.28.0 opencv-python-headless==4.7.0.72 tensorboard==2.12.2 numpy==1.23.5 scikit-image==0.20.0 scipy==1.10.1 tqdm==4.65.0 lpips==0.1.4 pycocotools==2.0.6 pytorch-fid==0.3.0

dariopavllo commented 1 year ago

Hi,

This looks like an issue related to spawning processes in the data loaders (it's something Windows-specific). Can you try to set all num_workers to 0 in https://github.com/google-research/nerf-from-image/blob/main/data/loaders.py ?

jclarkk commented 1 year ago

Thanks. I'm getting another process-related exception so I'll keep it on this topic:

I'm receiving a Dataloader worker exception on Debian: `RuntimeError: DataLoader worker (pid 5798) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/nerf-from-image/run.py", line 157, in dataset_config, train_split, train_eval_split, test_split = loaders.load_dataset( File "/nerf-from-image/data/loaders.py", line 222, in load_dataset train_split, train_eval_split, test_split = loader(dataset_config, args, File "/nerf-from-image/data/loaders.py", line 438, in load_shapenet train_split.images, train_split.tform_cam2world, train_split.focal_length = load_shapenet( File "/nerf-from-image/data/loaders.py", line 424, in load_shapenet for i, sample in enumerate(tqdm(loader)): File "/.local/lib/python3.10/site-packages/tqdm/std.py", line 1178, in iter for obj in iterable: File "/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in next data = self._next_data() File "/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data idx, data = self._get_data() File "/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data success, data = self._try_get_data() File "/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 5798) exited unexpectedly`

It was on: PRETTY_NAME="Debian GNU/Linux 10 (buster)" NAME="Debian GNU/Linux" VERSION_ID="10" VERSION="10 (buster)" VERSION_CODENAME=buster

Running on GCP with "a2-highgpu-1g" and NVIDIA A100 40GB.

Edit: Command used: python3 run.py --resume_from g_shapenet_chairs_pretrained --inv_export_demo_sample --gpus 1 --batch_size 4

dariopavllo commented 1 year ago

Most likely, it's an out-of-memory error. Does this happen with the smaller datasets?

The script pre-loads the entire dataset into memory for performance reasons. You can try to either increase the memory of the VM or extract a sample of the dataset, if that's the issue.