memorysafety / rav1d

An AV1 decoder in Rust.
BSD 2-Clause "Simplified" License
245 stars 15 forks source link

Performance regressions against dav1d master on x86_64 #805

Open negge opened 5 months ago

negge commented 5 months ago

Similar to https://github.com/memorysafety/rav1d/issues/804, the performance of target/release/dav1d on x86_64 is worse than dav1d 1.4.0-83-g872e470 on both 8-bit and 10-bit videos in both single-threaded and multi-threaded configurations. There is something especially odd going on with 10-bit multi-threaded where the CPU utilization is almost a 1/3 lower than dav1d, and thus takes 154% longer to decode.

Single-Threaded

dav1d 1.4.0-83-g872e470 rav1d 966d63c1 % delta
8-bit Wall time (s) 1:41.30 1:49.85 8.44%
10-bit Wall time (s) 2:15.80 2:25.11 6.86%
8-bit RSS (kbytes) 35024 36352 3.79%
10-bit RSS (kbytes) 62228 63628 2.24%

Multi-Threaded

dav1d 1.4.0-83-g872e470 rav1d 966d63c1 % delta
8-bit Wall time (s) 0:18.98 0:20.77 9.43%
10-bit Wall time (s) 0:21.56 0:54.86 154.45%
8-bit RSS (kbytes) 170752 214948 25.88%
10-bit RSS (kbytes) 276168 291084 5.40%

Single-Threaded Data

nathan@vulpes ~/git/dav1d/build $ /usr/bin/time -v tools/dav1d --threads 1 -i ~/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null
dav1d 1.4.0-83-g872e470 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 88.15/23.98 fps (3.68x)
    Command being timed: "tools/dav1d --threads 1 -i /home/nathan/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null"
    User time (seconds): 101.09
    System time (seconds): 0.11
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:41.30
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 35024
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 8381
    Voluntary context switches: 1
    Involuntary context switches: 38645
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
nathan@vulpes ~/git/rav1d $ /usr/bin/time -v target/release/dav1d --threads 1 -i ~/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null
dav1d 966d63c1 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 81.29/23.98 fps (3.39x)
    Command being timed: "target/release/dav1d --threads 1 -i /home/nathan/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null"
    User time (seconds): 109.65
    System time (seconds): 0.10
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:49.85
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 36352
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 8483
    Voluntary context switches: 1
    Involuntary context switches: 41780
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
nathan@vulpes ~/git/dav1d/build $ /usr/bin/time -v tools/dav1d --threads 1 -i ~/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
dav1d 1.4.0-83-g872e470 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 65.75/23.98 fps (2.74x)
    Command being timed: "tools/dav1d --threads 1 -i /home/nathan/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null"
    User time (seconds): 135.53
    System time (seconds): 0.14
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:15.80
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 62228
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 15117
    Voluntary context switches: 1
    Involuntary context switches: 51894
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
nathan@vulpes ~/git/rav1d $ /usr/bin/time -v target/release/dav1d --threads 1 -i ~/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
dav1d 966d63c1 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 61.53/23.98 fps (2.57x)
    Command being timed: "target/release/dav1d --threads 1 -i /home/nathan/Videos/Netflix/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null"
    User time (seconds): 144.84
    System time (seconds): 0.13
    Percent of CPU this job got: 99%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:25.11
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 63628
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 15243
    Voluntary context switches: 1
    Involuntary context switches: 55258
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096

Multi-Threaded Data

nathan@vulpes ~/git/dav1d/build $ /usr/bin/time -v tools/dav1d -i ~/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null
dav1d 1.4.0-83-g872e470 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 470.88/23.98 fps (19.64x)
    Command being timed: "tools/dav1d -i /home/nathan/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null"
    User time (seconds): 236.45
    System time (seconds): 7.00
    Percent of CPU this job got: 1282%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.98
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 170752
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 69097
    Voluntary context switches: 1945017
    Involuntary context switches: 81866
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
nathan@vulpes ~/git/rav1d $ /usr/bin/time -v target/release/dav1d -i ~/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null
dav1d 966d63c1 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 437.71/23.98 fps (18.26x)
    Command being timed: "target/release/dav1d -i /home/nathan/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null"
    User time (seconds): 252.68
    System time (seconds): 6.26
    Percent of CPU this job got: 1246%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.77
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 214948
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 90262
    Voluntary context switches: 1583336
    Involuntary context switches: 89461
    Swaps: 0
    File system inputs: 612456
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
nathan@vulpes ~/git/dav1d/build $ /usr/bin/time -v tools/dav1d -i ~/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
dav1d 1.4.0-83-g872e470 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 414.52/23.98 fps (17.29x)
    Command being timed: "tools/dav1d -i /home/nathan/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null"
    User time (seconds): 351.48
    System time (seconds): 10.43
    Percent of CPU this job got: 1677%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.56
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 276168
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 109138
    Voluntary context switches: 2469059
    Involuntary context switches: 118065
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
nathan@vulpes ~/git/rav1d $ /usr/bin/time -v target/release/dav1d -i ~/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
dav1d 966d63c1 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 163.76/23.98 fps (6.83x)
    Command being timed: "target/release/dav1d -i /home/nathan/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null"
    User time (seconds): 330.14
    System time (seconds): 9.94
    Percent of CPU this job got: 619%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:54.86
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 291084
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 111240
    Voluntary context switches: 2342865
    Involuntary context switches: 125011
    Swaps: 0
    File system inputs: 562912
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
fbossen commented 5 months ago

I suspect #763 contributes significantly to the poor performance for 10-bit.

fbossen commented 5 months ago

Interestingly, profiling data for x64_64 looks quite different from aarch64 (summer_nature_1080p):

rav1d
  11.12%  dav1d    dav1d             [.] rav1d::src::recon::decode_coefs
   9.21%  dav1d    dav1d             [.] rav1d::src::decode::decode_b_inner
   5.10%  dav1d    dav1d             [.] rav1d::src::refmvs::rav1d_refmvs_find
   3.92%  dav1d    dav1d             [.] rav1d::src::recon::mc
   3.85%  dav1d    dav1d             [.] dav1d_msac_decode_symbol_adapt4_sse2
   3.60%  dav1d    dav1d             [.] rav1d::src::refmvs::load_tmvs_c
   3.45%  dav1d    dav1d             [.] rav1d::src::recon::rav1d_recon_b_inter
   3.35%  dav1d    dav1d             [.] rav1d::src::refmvs::add_temporal_candidate
   2.72%  dav1d    dav1d             [.] rav1d::src::refmvs::add_spatial_candidate
   2.47%  dav1d    dav1d             [.] dav1d_msac_decode_symbol_adapt4_sse2.renorm5
   1.84%  dav1d    dav1d             [.] dav1d_msac_decode_bool_adapt_sse2
   1.57%  dav1d    dav1d             [.] dav1d_prep_8tap_8bpc_avx512icl.hv_loop
   1.56%  dav1d    dav1d             [.] ..@1032.branch_instr
   1.51%  dav1d    dav1d             [.] rav1d::src::decode::decode_sb
   1.19%  dav1d    libc.so.6         [.] __memset_evex_unaligned_erms
   1.16%  dav1d    dav1d             [.] dav1d_lpf_h_sb_y_8bpc_avx512icl.loop
   1.04%  dav1d    dav1d             [.] dav1d_msac_decode_symbol_adapt4_sse2.end
   1.01%  dav1d    dav1d             [.] ..@1385.branch_instr
   1.00%  dav1d    dav1d             [.] rav1d::src::refmvs::scan_col
   0.98%  dav1d    dav1d             [.] rav1d::src::cdef_apply::rav1d_cdef_brow
dav1d 1.4.0
   7.41%  dav1d    libdav1d.so.7.0.0  [.] decode_b                                                                           
   7.19%  dav1d    libdav1d.so.7.0.0  [.] decode_coefs                                                                       
   6.24%  dav1d    libdav1d.so.7.0.0  [.] dav1d_refmvs_find                                                                  
   3.66%  dav1d    libdav1d.so.7.0.0  [.] mc                                                                                 
   3.57%  dav1d    libdav1d.so.7.0.0  [.] dav1d_recon_b_inter_8bpc                                                           
   3.49%  dav1d    libdav1d.so.7.0.0  [.] dav1d_msac_decode_symbol_adapt4_sse2                                               
   3.38%  dav1d    libdav1d.so.7.0.0  [.] decode_coefs.constprop.0                                                           
   3.19%  dav1d    libdav1d.so.7.0.0  [.] add_temporal_candidate                                                             
   2.96%  dav1d    libdav1d.so.7.0.0  [.] dav1d_msac_decode_symbol_adapt4_sse2.renorm5                                       
   2.07%  dav1d    libdav1d.so.7.0.0  [.] dav1d_msac_decode_bool_adapt_sse2                                                  
   2.01%  dav1d    libdav1d.so.7.0.0  [.] add_spatial_candidate                                                              
   1.78%  dav1d    libdav1d.so.7.0.0  [.] dav1d_prep_8tap_8bpc_avx512icl.hv_loop                                             
   1.73%  dav1d    libdav1d.so.7.0.0  [.] ..@1032.branch_instr                                                               
   1.69%  dav1d    libdav1d.so.7.0.0  [.] dav1d_create_lf_mask_inter                                                         
   1.69%  dav1d    libdav1d.so.7.0.0  [.] dav1d_load_tmvs_sse4.xloop                                                         
   1.50%  dav1d    libdav1d.so.7.0.0  [.] mask_edges_inter                                                                   
   1.44%  dav1d    libdav1d.so.7.0.0  [.] scan_col                                                                           
   1.40%  dav1d    libdav1d.so.7.0.0  [.] dav1d_msac_decode_symbol_adapt4_sse2.end                                           
   1.28%  dav1d    libdav1d.so.7.0.0  [.] dav1d_lpf_h_sb_y_8bpc_avx512icl.loop                                               
   1.27%  dav1d    libdav1d.so.7.0.0  [.] scan_row                                                                           
rinon commented 5 months ago

~I expect the multithreaded regressions here to be partially related to #763, so I plan to start there.~ Scrolled too far and missed that @fbossen already said this. I need to dig into that commit but I do expect it to be a significant culprit.