memorysafety / rav1d

An AV1 decoder in Rust.
BSD 2-Clause "Simplified" License
245 stars 15 forks source link

Performance regressions against dav1d master on aarch64 #804

Open negge opened 5 months ago

negge commented 5 months ago

The target/aarch64-unknown-linux-gnu/release/dav1d binary takes 5.8% more time and 22% more memory to decode 8-bit video than dav1d-1.4.0-83-g872e470 and 5.3% more time and 6.7% more memory to decode 10-bit video.

dav1d 1.4.0-83-g872e470 rav1d 966d63c1 % delta
8-bit User time (s) 606.34 641.91 5.87%
10-bit User time (s) 1002.20 1055.09 5.28%
8-bit RSS (kbytes) 201076 246724 22.70%
10-bit RSS (kbytes) 306708 327140 6.66%

Full command lines and output data below

negge@arm1:~/git/dav1d/build# /usr/bin/time -v tools/dav1d -i ~/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null
dav1d 1.4.0-83-g872e470 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 181.09/23.98 fps (7.55x)
    Command being timed: "tools/dav1d -i /home/negge/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null"
    User time (seconds): 606.34
    System time (seconds): 43.91
    Percent of CPU this job got: 1316%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.41
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 201076
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 162896
    Voluntary context switches: 2333840
    Involuntary context switches: 1822071
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
negge@arm1:~/git/rav1d# /usr/bin/time -v target/aarch64-unknown-linux-gnu/release/dav1d -i ~/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null
dav1d 966d63c1 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 170.58/23.98 fps (7.11x)
    Command being timed: "target/aarch64-unknown-linux-gnu/release/dav1d -i /home/negge/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null"
    User time (seconds): 641.91
    System time (seconds): 51.00
    Percent of CPU this job got: 1320%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:52.47
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 246724
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 4
    Minor (reclaiming a frame) page faults: 232651
    Voluntary context switches: 2243979
    Involuntary context switches: 1968275
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
negge@arm1:~/git/dav1d.jeffv/build# /usr/bin/time -v tools/dav1d -i ~/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
dav1d 1.4.0-83-g872e470 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 114.57/23.98 fps (4.78x)
    Command being timed: "tools/dav1d -i /home/negge/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null"
    User time (seconds): 1002.20
    System time (seconds): 60.66
    Percent of CPU this job got: 1355%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:18.39
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 306708
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 3
    Minor (reclaiming a frame) page faults: 374204
    Voluntary context switches: 2633828
    Involuntary context switches: 2819241
    Swaps: 0
    File system inputs: 562920
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
negge@arm1:~/git/rav1d# /usr/bin/time -v target/aarch64-unknown-linux-gnu/release/dav1d -i ~/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
dav1d 966d63c1 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 42.27/23.98 fps (1.76x)
    Command being timed: "target/aarch64-unknown-linux-gnu/release/dav1d -i /home/negge/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null"
    User time (seconds): 1055.09
    System time (seconds): 63.88
    Percent of CPU this job got: 529%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:31.40
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 327140
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 523375
    Voluntary context switches: 3074044
    Involuntary context switches: 2714008
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
fbossen commented 5 months ago

@negge Thank you for sharing. Indeed, there is currently a performance gap between dav1d and rav1d. We haven't really spent much time to try to close it, but I've been tracking performance to make sure the gap doesn't grow larger.

I ran a quick profiling test with summer_nature_1080p. There are clearly some Rust functions that are slower than their C equivalents.

dav1d 1.4.0
        msac_decode_symbol_adapt4_neon  (in libdav1d.7.dylib)        4176
        decode_coefs  (in libdav1d.7.dylib)        2995
        prep_8tap_neon  (in libdav1d.7.dylib)        2019
        decode_b  (in libdav1d.7.dylib)        1092
        dav1d_refmvs_find  (in libdav1d.7.dylib)        879
        load_tmvs_c  (in libdav1d.7.dylib)        862
        add_temporal_candidate  (in libdav1d.7.dylib)        667
        mc  (in libdav1d.7.dylib)        635
        dav1d_recon_b_inter_8bpc  (in libdav1d.7.dylib)        633
        put_8tap_neon  (in libdav1d.7.dylib)        610
        add_spatial_candidate  (in libdav1d.7.dylib)        561
        prep_neon  (in libdav1d.7.dylib)        471
        dav1d_create_lf_mask_inter  (in libdav1d.7.dylib)        394
        wiener_filter7_hv_8bpc_neon  (in libdav1d.7.dylib)        324
        msac_decode_bool_adapt_neon  (in libdav1d.7.dylib)        275
        avg_8bpc_neon  (in libdav1d.7.dylib)        229
        decode_sb  (in libdav1d.7.dylib)        224
        wiener_filter5_hv_8bpc_neon  (in libdav1d.7.dylib)        215
        cdef_filter8_sec_edged_8bpc_neon  (in libdav1d.7.dylib)        205
        msac_decode_hi_tok_neon  (in libdav1d.7.dylib)        193
rav1d a46bb72f
        msac_decode_symbol_adapt4_neon  (in dav1d)        4057
        rav1d::src::recon::decode_coefs::h81e1bc840e33f180  (in dav1d)        2899
        prep_8tap_neon  (in dav1d)        2079
        rav1d::src::decode::decode_b_inner::h9fd83b148c970197  (in dav1d)        1850
        rav1d::src::refmvs::add_temporal_candidate::h01ce0e51e98e92ff  (in dav1d)        935
        rav1d::src::refmvs::load_tmvs_c::ha11d4ff5a82bb433  (in dav1d)        933
        rav1d::src::refmvs::rav1d_refmvs_find::hc47eb9832700db67  (in dav1d)        761
        rav1d::src::refmvs::add_spatial_candidate::h2dd8a7df8b9924f2  (in dav1d)        729
        rav1d::src::recon::mc::h2ba26a7da1b07206  (in dav1d)        711
        rav1d::src::recon::rav1d_recon_b_inter::h54eca88c13ef1aa0  (in dav1d)        629
        _platform_memset  (in libsystem_platform.dylib)        625
        put_8tap_neon  (in dav1d)        566
        prep_neon  (in dav1d)        433
        wiener_filter7_hv_8bpc_neon  (in dav1d)        358
        rav1d::src::decode::decode_sb::h2372da60409d4a19  (in dav1d)        293
        cdef_filter8_sec_edged_8bpc_neon  (in dav1d)        268
        msac_decode_bool_adapt_neon  (in dav1d)        260
        rav1d::src::refmvs::scan_row::h5d3369bc5f56f722  (in dav1d)        206
        wiener_filter5_hv_8bpc_neon  (in dav1d)        197
        rav1d::src::recon::rav1d_recon_b_intra::h050feaff11ff5100  (in dav1d)        194
kkysen commented 5 months ago

Hi @negge, thanks for benchmarking this! As @fbossen said, we haven't had much time to look at performance yet as we've been focused primarily on making everything safe first (while trying not to introduce and performance regressions). Once that's accomplished, we'll turn to closing the performance gap.