princeton-vl / infinigen

Infinite Photorealistic Worlds using Procedural Generation
https://infinigen.org
BSD 3-Clause "New" or "Revised" License
5.14k stars 430 forks source link

When using cuda terrain with slurm jobs, encountered OSError: No such file or directory #207

Open zzyunzhi opened 3 months ago

zzyunzhi commented 3 months ago

Describe the bug

In the rendering stage (task = render), when enabling cuda terrain and executing the task with slurm jobs, I encountered the following error:

  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/core/execute_tasks.py", line 418, in main
    execute_tasks(
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/core/execute_tasks.py", line 340, in execute_tasks
    terrain = Terrain(scene_seed, surface.registry, task=task, on_the_fly_asset_folder=output_folder/"assets")
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/core.py", line 120, in __init__
    self.elements, scene_infos = scene(seed, Path(on_the_fly_asset_folder), asset_path, device)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/scene.py", line 56, in scene
    elements[ElementNames.LandTiles] = LandTiles(device, caves, on_the_fly_asset_folder, reused_asset_folder)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/landtiles.py", line 115, in __init__
    Element.__init__(self, "landtiles", material, transparency)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/core.py", line 28, in __init__
    dll = load_cdll(f"terrain/lib/{self.device}/elements/{lib_name_X}.so")
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/utils/ctype_util.py", line 29, in load_cdll
    return CDLL(root/path, mode=RTLD_LOCAL)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/lib/cuda/elements/landtiles_2.so: cannot open shared object file: No such file or directory
  In call to configurable 'Element' (<class 'infinigen.terrain.elements.core.Element'>)
  In call to configurable 'LandTiles' (<class 'infinigen.terrain.elements.landtiles.LandTiles'>)
  In call to configurable 'scene' (<function scene at 0x7faeac694f70>)
  In call to configurable 'Terrain' (<class 'infinigen.terrain.core.Terrain'>)
  In call to configurable 'execute_tasks' (<function execute_tasks at 0x7faea1d5bd90>)
keep_placeholder=True placeholder.name='BushFactory(543568399).spawn_placeholder(514)' list(placeholder.children)=[bpy.data.objects['BushFactory(543568399).spawn_asset(514)']] obj.name='BushFactory(543568399).spawn_asset(514)' list(obj.children)=[bpy.data.objects['Tree.656']]
ground already loaded, loading ground_1 instead
landtiles already loaded, loading landtiles_2 instead

The same script runs successfully in a slurm interactive session.

What version of the code were you using?

commit 5132903cd68704367d1c44c841e5163158e0f33d (HEAD -> main, origin/main, origin/HEAD)

What are your FULL output logs?

7348694_0_7348695_default.log

Platform

zzyunzhi commented 3 months ago

Encountered the same error with slurm jobs when disabling cuda terrain. The full log is attached. 7349009_0_7349010_no_cuda_terrain.log

araistrick commented 3 months ago

Could you provide the logs of install via pip install -vv -e . > logs.txt 2>&1? Its possible something didnt compile correctly, although it is strange this would only show up in the render job.

@mazeyu please take a look also.

zzyunzhi commented 3 months ago

Thank you Alex for the prompt reply! The logs of installation is attached. installation_logs.txt

Platform information: OS & OS Version: Linux GPU: A5000 GPU Driver Version: cuda 11.7

mazeyu commented 3 months ago

Hi, we cannot open these logs.txt files. Could you resend them?

zzyunzhi commented 3 months ago

Hi @mazeyu, please see all related logs here: https://drive.google.com/drive/folders/1_TSPAWKIsWuiEBJEh6Y4Qk1VWu8gMxcn?usp=sharing. Thanks!

mazeyu commented 3 months ago

I see. It is tricky. The Terrain() gets called several times when we run multiple tasks in a command. We didn't test it and this caused the bug. We will fix it and before that you can try running tasks separately, at least separating coarse, fineterrain, and render. Actually separating tasks is also recommended for better usage of resources.

larrrry1412 commented 3 months ago

hi, I met this problem, too. But its wrong in the fine_terrain task. How to solve it? It's my command "python -m infinigen.datagen.manage_jobs --output_folder outputs/dev2 --num_scenes 1 \ --pipeline_config local_64GB monocular cuda_terrain \ --cleanup big_files --warmup_sec 1200 --configs dev --overwrite"

zzyunzhi commented 3 months ago

Hi @mazeyu, thanks for the reply. I encountered the same error when running tasks separately as well using slurm jobs. Below is a truncated log.err that contains relevant error information:

  terrain = Terrain(scene_seed, surface.registry, task='coarse', on_the_fly_asset_folder=output_folder / "assets")
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/core.py", line 120, in __init__
    self.elements, scene_infos = scene(seed, Path(on_the_fly_asset_folder), asset_path, device)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/scene.py", line 87, in scene
    elements[ElementNames.FloatingIce] = FloatingIce(
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/landtiles.py", line 195, in __init__
    LandTiles.__init__(
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/landtiles.py", line 115, in __init__
    Element.__init__(self, "landtiles", material, transparency)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/core.py", line 28, in __init__
    dll = load_cdll(f"terrain/lib/{self.device}/elements/{lib_name_X}.so")
  File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/utils/ctype_util.py", line 29, in load_cdll
    return CDLL(root/path, mode=RTLD_LOCAL)
  File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/lib/cpu/elements/landtiles_2.so: cannot open shared object file: No such file or directory
  In call to configurable 'Element' (<class 'infinigen.terrain.elements.core.Element'>)
  In call to configurable 'LandTiles' (<class 'infinigen.terrain.elements.landtiles.LandTiles'>)
  In call to configurable 'scene' (<function scene at 0x7f0bbdbe9750>)
  In call to configurable 'Terrain' (<class 'infinigen.terrain.core.Terrain'>)
  In call to configurable 'execute_tasks' (<function execute_tasks at 0x7f0bb2f7a170>)
mazeyu commented 3 months ago

Can both of you @larrrry1412 @zzyunzhi provide the full command and the full log? Thanks

larrrry1412 commented 3 months ago

Can both of you @larrrry1412 @zzyunzhi provide the full command and the full log? Thanks Thanks for the quick reply. my command is "python -m infinigen.datagen.manage_jobs --output_folder outputs/my_videos2 --num_scenes 500 --pipeline_config local_64GB monocular_video cuda_terrain opengl_gt --cleanup big_files --warmup_sec 60000 --config video high_quality_terrain " the log is all right, but in the coarse/fine stage, it won't generate anything. coarse stage also sometimes break out. coarse.err file : [14:50:01.022] [infinigen.core.placement.animation_policy] [INFO] | Failed attempt=6 out of max_full_retries=10 for obj.name='CameraRigs/0'

0%| | 0/191 [00:00<?, ?it/s] 0%| | 0/191 [00:08<?, ?it/s] [14:50:09.654] [infinigen.core.placement.animation_policy] [INFO] | Failed attempt=7 out of max_full_retries=10 for obj.name='CameraRigs/0'

0%| | 0/191 [00:00<?, ?it/s] 0%| | 0/191 [00:27<?, ?it/s] [14:50:37.314] [infinigen.core.placement.animation_policy] [INFO] | Failed attempt=8 out of max_full_retries=10 for obj.name='CameraRigs/0'

0%| | 0/191 [00:00<?, ?it/s] 0%| | 0/191 [00:05<?, ?it/s] [14:50:43.011] [infinigen.core.placement.animation_policy] [INFO] | Failed attempt=9 out of max_full_retries=10 for obj.name='CameraRigs/0' [14:50:43.011] [infinigen.times] [INFO] | [animate_cameras] failed with <class 'ValueError'> [14:50:43.011] [infinigen.times] [INFO] | [MAIN TOTAL] failed with <class 'ValueError'> Traceback (most recent call last): File "/home/meta/anaconda3/envs/inf2/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/meta/anaconda3/envs/inf2/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/meta/Downloads/infinigen/infinigen_examples/generate_nature.py", line 438, in main(args) File "/home/meta/Downloads/infinigen/infinigen_examples/generate_nature.py", line 409, in main execute_tasks.main( File "/home/meta/Downloads/infinigen/infinigen/core/execute_tasks.py", line 418, in main execute_tasks( File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, new_kwargs) File "/home/meta/Downloads/infinigen/infinigen/core/execute_tasks.py", line 328, in execute_tasks compose_scene_func(output_folder, scene_seed) File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, *new_kwargs) File "/home/meta/Downloads/infinigen/infinigen_examples/generate_nature.py", line 213, in compose_scene p.run_stage('animate_cameras', lambda: cam_util.animate_cameras( File "/home/meta/Downloads/infinigen/infinigen/core/util/pipeline.py", line 76, in run_stage ret = fn(args, kwargs) File "/home/meta/Downloads/infinigen/infinigen_examples/generate_nature.py", line 213, in p.run_stage('animate_cameras', lambda: cam_util.animate_cameras( File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, *new_kwargs) File "/home/meta/Downloads/infinigen/infinigen/core/placement/camera.py", line 515, in animate_cameras animation_policy.animate_trajectory( File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/home/meta/anaconda3/envs/inf2/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper return fn(new_args, **new_kwargs) File "/home/meta/Downloads/infinigen/infinigen/core/placement/animation_policy.py", line 470, in animate_trajectory raise ValueError(err) ValueError: Animation for obj.name='CameraRigs/0' failed with max_full_retries=10 and max_step_tries=25, quitting In call to configurable 'animate_trajectory' (<function animate_trajectory at 0x7f3aa5f5f5b0>) In call to configurable 'animate_cameras' (<function animate_cameras at 0x7f3aa5f47760>) In call to configurable 'compose_scene' (<function compose_scene at 0x7f3c22186200>) In call to configurable 'execute_tasks' (<function execute_tasks at 0x7f3a9cec84c0>)

fine stage

mazeyu commented 3 months ago

@larrrry1412 Your error seems to be a different one. It just means the camera selection failed. How frequently does it happen? I think our current pipeline does allow occasional failure.

larrrry1412 commented 3 months ago

@mazeyu almost every time. And the fine folder never generate things.

zzyunzhi commented 3 months ago

Hi, the command I used ispython infinigen_examples/generate_nature.py -- --output_folder ${LOG_DIR}/coarse --task coarse --task_uniqname coarse -g video, but I've modified the source code so I'm not sure if this command would reproduce the issue. One thing I modified is changing all run_stage, e.g., https://github.com/princeton-vl/infinigen/blob/5132903cd68704367d1c44c841e5163158e0f33d/infinigen_examples/generate_nature.py#L79, to be direct calls of the function, i.e., terrain, terrain_mesh = add_coarse_terrain(). Just noting it here in case the information is useful.

mazeyu commented 3 months ago

@larrrry1412 were you able to run the hello world example with separate commands? (maybe we shall discuss this in a separate issue)

mazeyu commented 3 months ago

@zzyunzhi can you look at the code a bit and check how many times the class Terrain() is called? It is supposed to be called once and with no problem. But if your change somehow made it to be multiple times, please wait for our fix (actually simply save a copy of the instance) very soon.