Open zzyunzhi opened 8 months ago
Encountered the same error with slurm jobs when disabling cuda terrain. The full log is attached. 7349009_0_7349010_no_cuda_terrain.log
Could you provide the logs of install via pip install -vv -e . > logs.txt 2>&1
? Its possible something didnt compile correctly, although it is strange this would only show up in the render job.
@mazeyu please take a look also.
Thank you Alex for the prompt reply! The logs of installation is attached. installation_logs.txt
Platform information: OS & OS Version: Linux GPU: A5000 GPU Driver Version: cuda 11.7
Hi, we cannot open these logs.txt files. Could you resend them?
Hi @mazeyu, please see all related logs here: https://drive.google.com/drive/folders/1_TSPAWKIsWuiEBJEh6Y4Qk1VWu8gMxcn?usp=sharing. Thanks!
I see. It is tricky. The Terrain() gets called several times when we run multiple tasks in a command. We didn't test it and this caused the bug. We will fix it and before that you can try running tasks separately, at least separating coarse, fineterrain, and render. Actually separating tasks is also recommended for better usage of resources.
hi, I met this problem, too. But its wrong in the fine_terrain task. How to solve it? It's my command "python -m infinigen.datagen.manage_jobs --output_folder outputs/dev2 --num_scenes 1 \ --pipeline_config local_64GB monocular cuda_terrain \ --cleanup big_files --warmup_sec 1200 --configs dev --overwrite"
Hi @mazeyu, thanks for the reply. I encountered the same error when running tasks separately as well using slurm jobs. Below is a truncated log.err
that contains relevant error information:
terrain = Terrain(scene_seed, surface.registry, task='coarse', on_the_fly_asset_folder=output_folder / "assets")
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/core.py", line 120, in __init__
self.elements, scene_infos = scene(seed, Path(on_the_fly_asset_folder), asset_path, device)
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/scene.py", line 87, in scene
elements[ElementNames.FloatingIce] = FloatingIce(
File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/landtiles.py", line 195, in __init__
LandTiles.__init__(
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/landtiles.py", line 115, in __init__
Element.__init__(self, "landtiles", material, transparency)
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/elements/core.py", line 28, in __init__
dll = load_cdll(f"terrain/lib/{self.device}/elements/{lib_name_X}.so")
File "/viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/utils/ctype_util.py", line 29, in load_cdll
return CDLL(root/path, mode=RTLD_LOCAL)
File "/viscam/u/yzzhang/miniconda3/envs/infinigen/lib/python3.10/ctypes/__init__.py", line 374, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /viscam/projects/concepts/engine/engine/third_party/infinigen/infinigen/terrain/lib/cpu/elements/landtiles_2.so: cannot open shared object file: No such file or directory
In call to configurable 'Element' (<class 'infinigen.terrain.elements.core.Element'>)
In call to configurable 'LandTiles' (<class 'infinigen.terrain.elements.landtiles.LandTiles'>)
In call to configurable 'scene' (<function scene at 0x7f0bbdbe9750>)
In call to configurable 'Terrain' (<class 'infinigen.terrain.core.Terrain'>)
In call to configurable 'execute_tasks' (<function execute_tasks at 0x7f0bb2f7a170>)
Can both of you @larrrry1412 @zzyunzhi provide the full command and the full log? Thanks
Can both of you @larrrry1412 @zzyunzhi provide the full command and the full log? Thanks Thanks for the quick reply. my command is "python -m infinigen.datagen.manage_jobs --output_folder outputs/my_videos2 --num_scenes 500 --pipeline_config local_64GB monocular_video cuda_terrain opengl_gt --cleanup big_files --warmup_sec 60000 --config video high_quality_terrain " the log is all right, but in the coarse/fine stage, it won't generate anything. coarse stage also sometimes break out. coarse.err file : [14:50:01.022] [infinigen.core.placement.animation_policy] [INFO] | Failed attempt=6 out of max_full_retries=10 for obj.name='CameraRigs/0'
0%| | 0/191 [00:00<?, ?it/s] 0%| | 0/191 [00:08<?, ?it/s] [14:50:09.654] [infinigen.core.placement.animation_policy] [INFO] | Failed attempt=7 out of max_full_retries=10 for obj.name='CameraRigs/0'
0%| | 0/191 [00:00<?, ?it/s] 0%| | 0/191 [00:27<?, ?it/s] [14:50:37.314] [infinigen.core.placement.animation_policy] [INFO] | Failed attempt=8 out of max_full_retries=10 for obj.name='CameraRigs/0'
0%| | 0/191 [00:00<?, ?it/s]
0%| | 0/191 [00:05<?, ?it/s]
[14:50:43.011] [infinigen.core.placement.animation_policy] [INFO] | Failed attempt=9 out of max_full_retries=10 for obj.name='CameraRigs/0'
[14:50:43.011] [infinigen.times] [INFO] | [animate_cameras] failed with <class 'ValueError'>
[14:50:43.011] [infinigen.times] [INFO] | [MAIN TOTAL] failed with <class 'ValueError'>
Traceback (most recent call last):
File "/home/meta/anaconda3/envs/inf2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/meta/anaconda3/envs/inf2/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/meta/Downloads/infinigen/infinigen_examples/generate_nature.py", line 438, in
fine stage
@larrrry1412 Your error seems to be a different one. It just means the camera selection failed. How frequently does it happen? I think our current pipeline does allow occasional failure.
@mazeyu almost every time. And the fine folder never generate things.
Hi, the command I used ispython infinigen_examples/generate_nature.py -- --output_folder ${LOG_DIR}/coarse --task coarse --task_uniqname coarse -g video
, but I've modified the source code so I'm not sure if this command would reproduce the issue. One thing I modified is changing all run_stage
, e.g., https://github.com/princeton-vl/infinigen/blob/5132903cd68704367d1c44c841e5163158e0f33d/infinigen_examples/generate_nature.py#L79, to be direct calls of the function, i.e., terrain, terrain_mesh = add_coarse_terrain()
. Just noting it here in case the information is useful.
@larrrry1412 were you able to run the hello world example with separate commands? (maybe we shall discuss this in a separate issue)
@zzyunzhi can you look at the code a bit and check how many times the class Terrain() is called? It is supposed to be called once and with no problem. But if your change somehow made it to be multiple times, please wait for our fix (actually simply save a copy of the instance) very soon.
I have also encountered the same problem. How can I solve it? thank you
Describe the bug
In the rendering stage (task =
render
), when enabling cuda terrain and executing the task with slurm jobs, I encountered the following error:The same script runs successfully in a slurm interactive session.
What version of the code were you using?
commit 5132903cd68704367d1c44c841e5163158e0f33d (HEAD -> main, origin/main, origin/HEAD)
What are your FULL output logs?
7348694_0_7348695_default.log
Platform