Using video format is slower than images format

rakhimovv commented 2 months ago

System Info

- `lerobot` version: 0.1.0
- Platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.35
- Python version: 3.11.8
- Huggingface_hub version: 0.23.4
- Dataset version: 2.20.0
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- Cuda version: 12010
- Using GPU in script?: <fill in>

Information

[ ] One of the scripts in the examples/ folder of LeRobot
[ ] My own task or dataset (give details below)

Reproduction

Good day! Thanks for this amazing project.

I converted my custom dataset class to the lerobot format and ran the video benchmark. The results show that, regardless of the encoding parameters, the video_images_load_time_ratio is greater than 1.0.

I wonder if this is because my video is 15 fps or due to something else.

Expected behavior

The expectation is that at least for default params the video_images_load_time_ratio would be less than 1.0.

Cadene commented 2 months ago

Hello @rakhimovv , how does it compare against the numbers we report for the same encoding/decoding parameters?

rakhimovv commented 2 months ago

If I got the question right for libsvtav1,yuv420p,g=2,crf=30,fast_decode=0 I get the ratio equal to 4.43

I also tried to reproduce your results by running:

python benchmarks/video/run_video_benchmark.py \
    --output-dir outputs/video_benchmark \
    --repo-ids \
        lerobot/pusht_image \
    --vcodec libsvtav1 \
    --pix-fmt yuv420p \
    --g 2 \
    --crf 30 \
    --timestamps-modes 1_frame 2_frames 6_frames \
    --backends pyav \
    --num-samples 50 \
    --num-workers 5 \
    --save-frames 1

The result is following:

Maybe this is hardware problem, not clear

Cadene commented 2 months ago

@rakhimovv Sorry I cant access the image: "This private-user-images.githubusercontent.com page can’t be found"

On the same dataset (pusht_image for instance) what are the results we report vs your results, for libsvtav1,yuv420p,g=2,crf=30,fast_decode=0.

Thanks!

rakhimovv commented 2 months ago

Apologies, not clear why is not visible

Here is the full output:

repo_id,resolution,num_pixels,vcodec,pix_fmt,g,crf,timestamps_mode,backend,video_size_bytes,images_size_bytes,video_images_size_ratio,avg_load_time_video_ms,avg_load_time_images_ms,video_images_load_time_ratio,avg_mse,avg_psnr,avg_ssim
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,,1_frame,pyav,30479,186411,0.1635042996389698,106.26960419118404,10.823768712580204,9.818170270736612,0.0001853929452761,37.539657364505366,0.9893503785133362
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,,2_frames,pyav,30479,186411,0.1635042996389698,56.85018114745617,6.004360169172287,9.468149735476825,0.0001750499522579,37.7474555727703,0.989688277244568
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,,6_frames,pyav,30479,186411,0.1635042996389698,16.059761804838978,5.140606140096983,3.124098864445583,0.0001833448846958,37.55468415844686,0.98959481716156
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,30.0,1_frame,pyav,34913,186411,0.1872904495979314,104.07841052860022,7.879405990242958,13.208915831660429,0.0001555320585905,38.18878129792595,0.9902451038360596
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,30.0,2_frames,pyav,34913,186411,0.1872904495979314,49.16207356378436,7.039123009890318,6.984119114655221,0.0001540893986671,38.239272464438216,0.9900858402252196
lerobot/pusht_image,96 x 96,9216,libsvtav1,yuv420p,2,30.0,6_frames,pyav,34913,186411,0.1872904495979314,17.191315609961748,5.277618449181318,3.2574002413206893,0.0001560885518263,38.17960813947649,0.9904525876045228

Particularly for one frame setting the video_images_load_time_ratio is 14.24 (pusht_image)

aliberts commented 2 months ago

Hi, could you paste here basic info about your hardware (cpu model, ram) as well as your ffmpeg version? I'll add them to this google sheet (Different hardware tab). From what I'm seeing, your figures would indicate a hardware issue.

Keep in mind that the smaller the image size (resolution), the higher that ratio is going to get while still being okay when you look at the absolute values. For higher resolutions though, the gains should be significantly better. Pusht is particular in the sense that it has very small resolution (96x96). I'd suggest looking at higher resolutions as well.

Also, note that the primary reason we started doing video encoding is not loading time — although important — it's to reduce the size of our datasets which can get pretty big.

rakhimovv commented 2 months ago

Hi,

lscpu gives the following:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  96
  On-line CPU(s) list:   0-95
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  24
    Socket(s):           2
    Stepping:            7
    CPU max MHz:         4000.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            6000.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtp
                         r pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a
                          avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   1.5 MiB (48 instances)
  L1i:                   1.5 MiB (48 instances)
  L2:                    48 MiB (48 instances)
  L3:                    71.5 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-23,48-71
  NUMA node1 CPU(s):     24-47,72-95
Vulnerabilities:         
  Gather data sampling:  Mitigation; Microcode
  Itlb multihit:         KVM: Mitigation: Split huge pages
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; Enhanced IBRS
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Mitigation; TSX disabled

ffmpeg info:

ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 12.3.0 (conda-forge gcc 12.3.0-5)
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    60. 16.100 / 60. 16.100
  libavdevice    60.  3.100 / 60.  3.100
  libavfilter     9. 12.100 /  9. 12.100
  libswscale      7.  5.100 /  7.  5.100
  libswresample   4. 12.100 /  4. 12.100
  libpostproc    57.  3.100 / 57.  3.100

hwinfo --memory gives the following:

01: None 00.0: 10102 Main Memory                                
  [Created at memory.74]
  Unique ID: rdCR.CxwsZFjVASF
  Hardware Class: memory
  Model: "Main Memory"
  Memory Range: 0x00000000-0xbc99d78fff (rw)
  Memory Size: 768 GB
  Config Status: cfg=new, avail=yes, need=no, active=unknown

Regarding the bigger image size, indeed the situation is better:

repo_id,resolution,num_pixels,vcodec,pix_fmt,g,crf,timestamps_mode,backend,video_size_bytes,images_size_bytes,video_images_size_ratio,avg_load_time_video_ms,avg_load_time_images_ms,video_images_load_time_ratio,avg_mse,avg_psnr,avg_ssim
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,,1_frame,pyav,1464198,82449519,0.017758720945358,61.17562886327505,524.97538395226,0.1165304712055569,0.0002039659873175,36.91625817381973,0.909278929233551
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,,2_frames,pyav,1464198,82449519,0.017758720945358,35.76316695660353,28.92831712961197,1.2362684907099273,0.0002014465445714,36.96356132919017,0.9099832773208618
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,,6_frames,pyav,1464198,82449519,0.017758720945358,12.109326478093864,20.366764316956203,0.5945630974878189,0.0002011011616727,36.9718784283328,0.9096388816833496
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,30.0,1_frame,pyav,2117661,82449519,0.0256843341924165,57.204018384218216,29.468527175486088,1.9411902754272832,0.0001782149818392,37.504091172213954,0.9204418659210204
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,30.0,2_frames,pyav,2117661,82449519,0.0256843341924165,37.78688319027424,25.17366273328662,1.5010482817150568,0.000175699656707,37.56065305206734,0.9211216568946838
custom_data,480 x 640,307200,libsvtav1,yuv420p,2,30.0,6_frames,pyav,2117661,82449519,0.0256843341924165,14.621594349543251,20.854213082542017,0.7011338328456821,0.0001753976071578,37.56902574530079,0.9208369851112366

I guess one more bottleneck is that if we store as video, we have a bottleneck when storing several images (video_frame_keys), as the function load_from_videos contains a for loop (https://github.com/huggingface/lerobot/blob/main/lerobot/common/datasets/video_utils.py#L45)

aliberts commented 2 months ago

Thank you for sharing! These results on higher resolution look much more on par with what I got in terms of size and load time ratios. However SSIM doesn't look that great (MSE is okay though). I'm curious about the content of your custom_data. Any chance you could share it? If not, what kind is it? Is it simulation rendering or real world? Rather dynamic or static?

I guess one more bottleneck is that if we store as video, we have a bottleneck when storing several images (video_frame_keys), as the function load_from_videos contains a for loop

Yes, we will probably improve that part when we update the decoder.

rakhimovv commented 2 months ago

@aliberts Sure, I have uploaded the dataset https://huggingface.co/datasets/rusrakhimov/put_grapes_into_a_bowl

Several episodes, using images format

aliberts commented 2 months ago

Thanks, I reproduced your results:

What's "reassuring" me is that it's also bad — even worse — with the other codecs (x264, x265). I found your images quite dark (i.e. brightness is low) and after a bit of digging it seems that video encoding quality can be quite impacted by this factor. Nothing to worry too much about here as >90% SSIM is still considered good quality and MSE and PSNR seem to agree. But even taking out video encoding/decoding out of the equation, it's good practice to have a well-lit setup to make the training of your policy easier.

Here's one setup we use, they are cheap and you can easily source them online:

IMG_0554 Medium

rakhimovv commented 2 months ago

@aliberts I see, thank you. I will take a look! By the way, learning policies that are robust to lighting conditions is also a goal 😄

aliberts commented 2 months ago

By the way, learning policies that are robust to lighting conditions is also a goal 😄

@rakhimovv to help with that, you can use data augmentation on images as well. Simply add training.image_transforms.enable=true on your train.py.

You can use this script lerobot/scripts/visualize_image_transforms.py(1) to visualize the transforms and help you fine-tune their parameters to your liking.

You'll also find more info in the default.yaml config ;)

(1): It's broken right now but should be fixed very soon (#333) EDIT: Fixed

rakhimovv commented 2 months ago

@aliberts Awesome, thank you!

huggingface / lerobot