mpv-player / mpv

🎥 Command line video player
https://mpv.io
Other
28.72k stars 2.93k forks source link

scaletempo2 takes too much CPU #8848

Closed lcksk closed 3 years ago

lcksk commented 3 years ago

Important Information

Provide following Information:

If you're not using git master or the latest release, update. Releases are listed here: https://github.com/mpv-player/mpv/releases

Reproduction steps

Recently, when I use 2x speed playback, the CPU is more than twice as high as when I use 1x speed playback. When 1x , cpu= 18.2 % ./mpv --no-config /home/test/Videos/1080-test.mp4 then if 2x,cpu = 48.2 % ./mpv --no-config --speed=2.0 /home/test/Videos/1080-test.mp4

Expected behavior

There shouldn't be such a high CPU usage. Compared with scaletempo, my impression is close to 10% less than scaletempo 2 (but I'm not sure, it's just an inaccurate recollection of the previous impression -( )

Actual behavior

Log file

The issue will be closed for ignoring the issue template.

Sample files

Sample files needed to reproduce this issue can be uploaded to https://0x0.st/ or similar sites. (Only needed if the issue cannot be reproduced without it.) Do not use garbage like "cloud storage", especially not Google Drive.

lcksk commented 3 years ago

Here is a perf flame diagram for 2x playback: 2x_playback_highcpu

avih commented 3 years ago

I don't understand what "too much" means. Compared to what? and by who's standards?

Also, why did you fail to follow the report template? the log file is required, and the "expected behavior" is needed too.

If you want to compare it to the older scaletempo, then just use --af=scaletempo. If you want to compare it to without audio at all, just use --no-audio.

I'm closing this for now.

Feel free to request to open it once there's actual info/data which suggests an issue.

CounterPillow commented 3 years ago

scaletempo2 isn't even showing in that perf flamegraph. The decoder is though, because you are decoding twice as much data in the same amount of time. Maybe that's what you're seeing.

lcksk commented 3 years ago

I'm sorry that I didn't collect logs because I posted them on my mobile phone for some reasons. Now I've done some further work. I hope it's useful,

./mpv --msg-level=all=trace --log-file=~/Desktop/scaletempo.log --hwdec=vaapi --vo=gpu --af=scaletempo --speed=2.0 /home/luckych/Videos/test_1080P.mov top -d 1 |grep mpv 2>&1 |tee ~/Desktop/scaletempo.log

./mpv --msg-level=all=trace --log-file=~/Desktop/scaletempo2.log --hwdec=vaapi --vo=gpu --af=scaletempo2 --speed=2.0 /home/luckych/Videos/test_1080P.mov top -d 1 |grep mpv 2>&1 |tee ~/Desktop/scaletempo2.log

cvlc --rate 2.0 --file-logging --logfile ~/Desktop/vlc_2x.log --log-verbose 3 /home/luckych/Videos/test_1080P.mov top -d 1 |grep vlc 2>&1 |tee ~/Desktop/vlc_cpu.log logs & CPU data of scaletempo, scaletempo2 and VLC are here: trace_cpu.zip cpu_result

VLC is just for comparison. Maybe the decoder and VO used by VLC are different, so the CPU of VLC is very low compared with MPV. I think this may be related to the command line... Here I don't want to discuss configuration, but I just want to compare the performance differences between scaletempo and scaletempo2. Under the same conditions, the differences are obvious from the data results.

lcksk commented 3 years ago

scaletempo2 isn't even showing in that perf flamegraph.

No, it's here, and if I understand it correctly, its flame graph frequency is almost the same as the soft decoding function call stack.…╯﹏╰

CounterPillow commented 3 years ago

I see it now, decode_slice is about 3 times bigger than scaletempo2 though.

I can look into SIMDing it, but I'm only familiar with AArch64 assembly

CounterPillow commented 3 years ago

Found an easy optimisation, didn't bother to profile it but feel free to if you have the tooling all set up.

haasn commented 3 years ago

@rolandjon can you attach a log of both configurations?

lcksk commented 3 years ago

Wow, it’s amazing, with @haasn's patch,there is definitely performance improvement, about 2~3 times improvement in my case.

I don’t have an aarch64 device, maybe someone else can help verify @CounterPillow' patch and post feedback here.🙂

CounterPillow commented 3 years ago

Currently I'm trying to get a gcc bug reporting account (they have automatic signups disabled, I needed to e-mail them) so that I can tell them they are not very good at generating vector code on aarch64. If they fix it we can just use haasn's patch and throw mine away, otherwise we can merge both together I guess and use mine on aarch64.

Glad to know this works well for you, and thank you for nerdsniping us into writing some SIMD.

lcksk commented 3 years ago

@rolandjon can you attach a log of both configurations?

@haasn Sorry I might have missed this comment,you required log should be here https://github.com/mpv-player/mpv/files/6526698/trace_cpu.zip

BTW, I tested it (https://github.com/mpv-player/mpv/commit/0d75b7a192abc1c36e5842fc7c64fcdcd2655114) with an Android device today. If drop HAVE_VECTOR macro, your code will actually have an immediate effect on the Android platform I tested. (with https://github.com/mpv-android/mpv-android project ). So, I'm thinking whether there is something wrong with wscript detecting the vector instruction in waf, in fact, there should be more platforms that benefit. -)

lcksk commented 3 years ago

Currently I'm trying to get a gcc bug reporting account (they have automatic signups disabled, I needed to e-mail them) so that I can tell them they are not very good at generating vector code on aarch64. If they fix it we can just use haasn's patch and throw mine away, otherwise we can merge both together I guess and use mine on aarch64.

Glad to know this works well for you, and thank you for nerdsniping us into writing some SIMD.

Loving community.You are our superheros. ^-^

lcksk commented 3 years ago

Another advantage of running a few more platforms is that I almost found another function (multi_channel_moving_block_energies) with performance issures,

Flamegraph of an Android device: p