Closed rakhimovv closed 2 months ago
Hello @rakhimovv , how does it compare against the numbers we report for the same encoding/decoding parameters?
If I got the question right for libsvtav1,yuv420p,g=2,crf=30,fast_decode=0 I get the ratio equal to 4.43
I also tried to reproduce your results by running:
python benchmarks/video/run_video_benchmark.py \
--output-dir outputs/video_benchmark \
--repo-ids \
lerobot/pusht_image \
--vcodec libsvtav1 \
--pix-fmt yuv420p \
--g 2 \
--crf 30 \
--timestamps-modes 1_frame 2_frames 6_frames \
--backends pyav \
--num-samples 50 \
--num-workers 5 \
--save-frames 1
The result is following:
Maybe this is hardware problem, not clear
@rakhimovv Sorry I cant access the image: "This private-user-images.githubusercontent.com page can’t be found"
On the same dataset (pusht_image for instance) what are the results we report vs your results, for libsvtav1,yuv420p,g=2,crf=30,fast_decode=0
.
Thanks!
Apologies, not clear why is not visible
Here is the full output:
repo_id,resolution,num_pixels,vcodec,pix_fmt,g,crf,timestamps_mode,backend,video_size_bytes,images_size_bytes,video_images_size_ratio,avg_load_time_video_ms,avg_load_time_images_ms,video_images_load_time_ratio,avg_mse,avg_psnr,avg_ssim
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,,1_frame,pyav,30479,186411,0.1635042996389698,106.26960419118404,10.823768712580204,9.818170270736612,0.0001853929452761,37.539657364505366,0.9893503785133362
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,,2_frames,pyav,30479,186411,0.1635042996389698,56.85018114745617,6.004360169172287,9.468149735476825,0.0001750499522579,37.7474555727703,0.989688277244568
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,,6_frames,pyav,30479,186411,0.1635042996389698,16.059761804838978,5.140606140096983,3.124098864445583,0.0001833448846958,37.55468415844686,0.98959481716156
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,30.0,1_frame,pyav,34913,186411,0.1872904495979314,104.07841052860022,7.879405990242958,13.208915831660429,0.0001555320585905,38.18878129792595,0.9902451038360596
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,30.0,2_frames,pyav,34913,186411,0.1872904495979314,49.16207356378436,7.039123009890318,6.984119114655221,0.0001540893986671,38.239272464438216,0.9900858402252196
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,30.0,6_frames,pyav,34913,186411,0.1872904495979314,17.191315609961748,5.277618449181318,3.2574002413206893,0.0001560885518263,38.17960813947649,0.9904525876045228
Particularly for one frame setting the video_images_load_time_ratio
is 14.24 (pusht_image)
Hi, could you paste here basic info about your hardware (cpu model, ram) as well as your ffmpeg version? I'll add them to this google sheet (Different hardware
tab). From what I'm seeing, your figures would indicate a hardware issue.
Keep in mind that the smaller the image size (resolution), the higher that ratio is going to get while still being okay when you look at the absolute values. For higher resolutions though, the gains should be significantly better. Pusht is particular in the sense that it has very small resolution (96x96). I'd suggest looking at higher resolutions as well.
Also, note that the primary reason we started doing video encoding is not loading time — although important — it's to reduce the size of our datasets which can get pretty big.
Hi,
lscpu
gives the following:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
Stepping: 7
CPU max MHz: 4000.0000
CPU min MHz: 1200.0000
BogoMIPS: 6000.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtp
r pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a
avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 1.5 MiB (48 instances)
L1i: 1.5 MiB (48 instances)
L2: 48 MiB (48 instances)
L3: 71.5 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
Vulnerabilities:
Gather data sampling: Mitigation; Microcode
Itlb multihit: KVM: Mitigation: Split huge pages
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; Enhanced IBRS
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Mitigation; TSX disabled
ffmpeg info:
ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 12.3.0 (conda-forge gcc 12.3.0-5)
libavutil 58. 29.100 / 58. 29.100
libavcodec 60. 31.102 / 60. 31.102
libavformat 60. 16.100 / 60. 16.100
libavdevice 60. 3.100 / 60. 3.100
libavfilter 9. 12.100 / 9. 12.100
libswscale 7. 5.100 / 7. 5.100
libswresample 4. 12.100 / 4. 12.100
libpostproc 57. 3.100 / 57. 3.100
hwinfo --memory
gives the following:
01: None 00.0: 10102 Main Memory
[Created at memory.74]
Unique ID: rdCR.CxwsZFjVASF
Hardware Class: memory
Model: "Main Memory"
Memory Range: 0x00000000-0xbc99d78fff (rw)
Memory Size: 768 GB
Config Status: cfg=new, avail=yes, need=no, active=unknown
Regarding the bigger image size, indeed the situation is better:
repo_id,resolution,num_pixels,vcodec,pix_fmt,g,crf,timestamps_mode,backend,video_size_bytes,images_size_bytes,video_images_size_ratio,avg_load_time_video_ms,avg_load_time_images_ms,video_images_load_time_ratio,avg_mse,avg_psnr,avg_ssim
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,,1_frame,pyav,1464198,82449519,0.017758720945358,61.17562886327505,524.97538395226,0.1165304712055569,0.0002039659873175,36.91625817381973,0.909278929233551
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,,2_frames,pyav,1464198,82449519,0.017758720945358,35.76316695660353,28.92831712961197,1.2362684907099273,0.0002014465445714,36.96356132919017,0.9099832773208618
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,,6_frames,pyav,1464198,82449519,0.017758720945358,12.109326478093864,20.366764316956203,0.5945630974878189,0.0002011011616727,36.9718784283328,0.9096388816833496
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,30.0,1_frame,pyav,2117661,82449519,0.0256843341924165,57.204018384218216,29.468527175486088,1.9411902754272832,0.0001782149818392,37.504091172213954,0.9204418659210204
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,30.0,2_frames,pyav,2117661,82449519,0.0256843341924165,37.78688319027424,25.17366273328662,1.5010482817150568,0.000175699656707,37.56065305206734,0.9211216568946838
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,30.0,6_frames,pyav,2117661,82449519,0.0256843341924165,14.621594349543251,20.854213082542017,0.7011338328456821,0.0001753976071578,37.56902574530079,0.9208369851112366
I guess one more bottleneck is that if we store as video, we have a bottleneck when storing several images (video_frame_keys), as the function load_from_videos contains a for loop (https://github.com/huggingface/lerobot/blob/main/lerobot/common/datasets/video_utils.py#L45)
Thank you for sharing!
These results on higher resolution look much more on par with what I got in terms of size and load time ratios. However SSIM doesn't look that great (MSE is okay though). I'm curious about the content of your custom_data
. Any chance you could share it? If not, what kind is it? Is it simulation rendering or real world? Rather dynamic or static?
I guess one more bottleneck is that if we store as video, we have a bottleneck when storing several images (video_frame_keys), as the function load_from_videos contains a for loop
Yes, we will probably improve that part when we update the decoder.
@aliberts Sure, I have uploaded the dataset https://huggingface.co/datasets/rusrakhimov/put_grapes_into_a_bowl
Several episodes, using images format
Thanks, I reproduced your results:
What's "reassuring" me is that it's also bad — even worse — with the other codecs (x264, x265). I found your images quite dark (i.e. brightness is low) and after a bit of digging it seems that video encoding quality can be quite impacted by this factor. Nothing to worry too much about here as >90% SSIM is still considered good quality and MSE and PSNR seem to agree. But even taking out video encoding/decoding out of the equation, it's good practice to have a well-lit setup to make the training of your policy easier.
Here's one setup we use, they are cheap and you can easily source them online:
@aliberts I see, thank you. I will take a look! By the way, learning policies that are robust to lighting conditions is also a goal 😄
By the way, learning policies that are robust to lighting conditions is also a goal 😄
@rakhimovv to help with that, you can use data augmentation on images as well.
Simply add training.image_transforms.enable=true
on your train.py
.
You can use this script lerobot/scripts/visualize_image_transforms.py
(1) to visualize the transforms and help you fine-tune their parameters to your liking.
You'll also find more info in the default.yaml
config ;)
(1): It's broken right now but should be fixed very soon (#333) EDIT: Fixed
@aliberts Awesome, thank you!
System Info
Information
Reproduction
Good day! Thanks for this amazing project.
I converted my custom dataset class to the lerobot format and ran the video benchmark. The results show that, regardless of the encoding parameters, the
video_images_load_time_ratio
is greater than 1.0.I wonder if this is because my video is 15 fps or due to something else.
Expected behavior
The expectation is that at least for default params the
video_images_load_time_ratio
would be less than 1.0.