princeton-vl / infinigen

Infinite Photorealistic Worlds using Procedural Generation
https://infinigen.org
BSD 3-Clause "New" or "Revised" License
5.16k stars 431 forks source link

Performance metrics #173

Open RodenLuo opened 7 months ago

RodenLuo commented 7 months ago

Hi,

I have run the following command on a decent workstation for over two days. It's not finished yet. I wonder if there are any performance metrics I can refer to. Many thanks!

python -m tools.manage_datagen_jobs --output_folder outputs/my_videos \
--num_scenes 2 --pipeline_config monocular_video cuda_terrain opengl_gt local_128GB \
--cleanup big_files --warmup_sec 60000 --configs under_water.gin high_quality_terrain.gin \
-p compose_scene.fish_school_chance=1.0 -p compose_scene.corals_chance=1.0 \
--pipeline_overrides LocalScheduleHandler.use_gpu=True

Ubuntu 22.04.3 LTS 125G Mem 48 processors, GenuineIntel, Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz NVIDIA GeForce RTX 3090

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                               
26491 luod      39  19   23.7g  13.0g 144756 R 811.5  10.3  28648:27 /home/luod/infinigen/blender/blender --background -y -noaudio --python generate.py --threads 8 -- --input_folder /home/luod/infinigen/worldgen/outpu+ 
$ ls
46f017a5  608f00b5  crashed_seeds.txt  crash_summaries.txt  datagen_command.sh  index.html  scenes_db.csv
$ cat crashed_seeds.txt 
608f00b5
$ cat crash_summaries.txt 
11/20 10:42PM outputs/my_videos/608f00b5/logs/coarse.err reason='Error: Python: Traceback (most recent call last):,Error: Could not find 1 camera views' node=None fatal=True
$ cat scenes_db.csv 
,all_done,seed,configs,num_running,num_done,coarse_job_obj,coarse_output_folder,coarse_submitted,populate_job_obj,populate_output_folder,populate_submitted,coarse_crash_recorded,any_fatal_crash
0,notdone,46f017a5,['mountain'],1,1,"LocalJob(job_id=96834731386, process=None, finalized=True)",/home/luod/infinigen/worldgen/outputs/my_videos/46f017a5/coarse,1,"LocalJob(job_id=31478256278, process=<Process name='my_videos_46f017a5_populate' pid=26490 parent=18447 started>, finalized=False)",/home/luod/infinigen/worldgen/outputs/my_videos/46f017a5/coarse,1.0,,
1,crashed,608f00b5,['cave'],0,1,"LocalJob(job_id=58623475689, process=None, finalized=True)",/home/luod/infinigen/worldgen/outputs/my_videos/608f00b5/coarse,1,,,,True,True
araistrick commented 7 months ago

Hello, you may want to check the outputs/my_videos/SEED/logs/ folders to check on whats running, its possible something has gotten stuck in some fashion, or crashed improperly. You can use the status reports printed by manage_datagen_jobs to figure out what step to look at.

It seems like one of the two scenes you ran crashed during camera selection - we use this type of camera selection crash to just give up on some seeds in which we cant find a camera trajectory, however its intended that once the scene crashes another will take its place, so id reccomend running --num_scenes 10 or greater.

We dont have any benchmarks or expected runtimes published as the code is still in constant flux. We are working on releasing various performance improvements so I expect the runtime will improve a lot in future versions. But for now I would certainly recommend running more seeds, since we have mostly optimized for throughput, not the latency of any particular seed finishing.

RodenLuo commented 7 months ago

Thanks, Alex!

once the scene crashes another will take its place

So, if I set one scene and it crashes, will the whole job scheduler stop, or will it automatically run for another scene to make it up? Right now, it seems to be the first case. If possible, I would recommend the latter and rm -rf for the crashed scene.

From the status reports below, I guess my job is at rendershort? Is there any doc explaining the meaning of each line starting with control_state? The rest are easy to guess from the literal.

outputs/my_videos 11/20 07:49PM -> 11/25 11:08AM

============================================================
control_state/curr_concurrent_max : 8
control_state/disk_usage       : 0.16
control_state/n_in_flight      : 2
control_state/try_to_launch    : 6
control_state/will_launch      : 0
crashed/coarse                 : 1
crashed/rendershort            : 1
crashed/total                  : 2
queued/renderbackup            : 1
queued/total                   : 1
running/rendershort            : 1
running/total                  : 1
succeeded/coarse               : 1
succeeded/fineterrain          : 5
succeeded/opengl               : 2
succeeded/populate             : 1
succeeded/rendershort          : 2
succeeded/savemesh             : 2
succeeded/total                : 13
------------------------------------------------------------
RodenLuo commented 7 months ago

So, the previously mentioned command, which should generate two scenes and has failed on one scene and keeps running for the other, now gives me the following status report.

outputs/my_videos 11/20 07:49PM -> 11/26 03:19PM
============================================================
control_state/curr_concurrent_max : 8
control_state/disk_usage       : 1.0
control_state/n_in_flight      : 0
control_state/try_to_launch    : 8
control_state/will_launch      : 1
crashed/coarse                 : 1
crashed/rendershort            : 3
crashed/total                  : 4
succeeded/coarse               : 1
succeeded/fineterrain          : 21
succeeded/opengl               : 17
succeeded/populate             : 1
succeeded/renderbackup         : 3
succeeded/rendershort          : 15
succeeded/savemesh             : 18
succeeded/total                : 76
------------------------------------------------------------
outputs/my_videos is full (100.0%). Sleeping.

I then checked the disk usage and found

$ du -sh *
785G    46f017a5
296K    608f00b5 # this is the failed scene's seed
4.0K    crashed_seeds.txt
4.0K    crash_summaries.txt
4.0K    datagen_command.sh
4.0K    index.html
20K     scenes_db.csv

$ cd 46f017a5/
46f017a5$ du -sh *
6.9G    coarse
542M    fine
509M    fine_0_0_0001_0
508M    fine_0_0_0009_0
507M    fine_0_0_0017_0
507M    fine_0_0_0025_0
506M    fine_0_0_0033_0
506M    fine_0_0_0041_0
508M    fine_0_0_0049_0
508M    fine_0_0_0057_0
507M    fine_0_0_0065_0
507M    fine_0_0_0073_0
510M    fine_0_0_0081_0
514M    fine_0_0_0089_0
519M    fine_0_0_0097_0
524M    fine_0_0_0105_0
529M    fine_0_0_0113_0
532M    fine_0_0_0121_0
534M    fine_0_0_0129_0
535M    fine_0_0_0137_0
535M    fine_0_0_0145_0
537M    fine_0_0_0153_0
28G     frames
542M    logs
44K     run_pipeline.sh
42G     savemesh_0_0_0001_0
42G     savemesh_0_0_0009_0
42G     savemesh_0_0_0017_0
42G     savemesh_0_0_0025_0
41G     savemesh_0_0_0033_0
41G     savemesh_0_0_0041_0
42G     savemesh_0_0_0049_0
42G     savemesh_0_0_0057_0
41G     savemesh_0_0_0065_0
41G     savemesh_0_0_0073_0
42G     savemesh_0_0_0081_0
42G     savemesh_0_0_0089_0
42G     savemesh_0_0_0097_0
42G     savemesh_0_0_0105_0
42G     savemesh_0_0_0113_0
42G     savemesh_0_0_0121_0
42G     savemesh_0_0_0129_0
42G     savemesh_0_0_0137_0
72K     tmp

What's happening, and why would one scene generate 0.7 TB of data? What's left for the jobs scheduler to finish if my disk were larger? Thanks!