renpy / renpy-build

Build system for the Ren'Py visual novel engine. (The engine itself, not games.)
76 stars 51 forks source link

Excessive CPU usage during playback of VP8/9 videos since commit c64c03478fac7f6b8733e2703035a71ecf0244ff #90

Closed gojira667 closed 1 year ago

gojira667 commented 1 year ago

Despite c64c03478fac7f6b8733e2703035a71ecf0244ff's comment it seems MMX is still important to the ffmpeg build. With it disabled it requires significantly more CPU on decode. Effectively reverting that commit on Win/Linux with something like:

diff --git a/tasks/ffmpeg.py b/tasks/ffmpeg.py
index 936fa7a..f98bedd 100644
--- a/tasks/ffmpeg.py
+++ b/tasks/ffmpeg.py
@@ -96,8 +96,10 @@ def build(c: Context):
         --enable-w32threads
 {% endif %}

+{% if c.platform == "mac" %}
         --disable-mmx
         --disable-mmxext
+{% endif %}

         --enable-ffmpeg
         --enable-ffplay

Allows for the noticeably reduced CPU usage.

Testing was done with @PastryIRL's vp9_test repo and captured via pidstat.

pidstat
LC_NUMERIC=C.UTF-8 pidstat -dru -Hh -C 'pythonw' 1 >> renpy-$(date +%s).pidstat

Here's a comparison with renpy-8.0.3-sdk default no MMX and one built with MMX.

Old A6-3500 Llano system (Debian sid)

In Ren'Py anything above 270% CPU is too much for this system and it starts frame dropping. E.G. 60 FPS VP9 movies turn into a slide show. The CPU drop halfway through was the switch over to the VP8 vid.

CPU flags `cat /proc/cpuinfo` ``` flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt cpb hw_pstate vmmcall arat npt lbrv svm_lock nrip_save pausefilter ```

lib-def-Llano-8 0 3-sdk

MMX lib-mmx-Llano-8 0 3-sdk

The Llano with the MMX enabled Ren'Py build can handle 60 FPS VP9 1080p vids just fine as long as the bitrate isn't out of control. Though sometimes devs encode it way higher than needed for 1080p... #

Now with a current CPU with renpy-8.1.1-sdk default no MMX and again one built with MMX.

New Ryzen system (Debian sid)

CPU flags `cat /proc/cpuinfo` ``` flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d ```

lib-def-Zen4-8 1 1-sdk

MMX lib-mmx-Zen4-8 1 1-sdk

This system has no playback issues. It's much faster per core and has plenty of CPU to spare. As you can see the MMX build still offers significant reduction in CPU even with the newer flags available. This naturally holds true for 4K video as well.

It would be nice to get MMX enabled again for at least the Win/Linux builds. This allows projects that use VP9 videos in particular to perform well on "marginal" systems and in the case of 4k expands the pool of systems able to play it at all.

gojira667 commented 1 year ago

Of course I forgot that the test VP9 video is YUV444... You still get a nice reduction in CPU usage with VP9 and YUV420 movies.

gojira667 commented 1 year ago

I made a sample 4k project using Tears of Steel as that's what Google uses in their examples. Downloaded it and took a short clip near the start:

ffmpeg -ss 00:00:09 -to 00:00:25 -i tearsofsteel_4k.mov -c copy tearsofsteel_4k_09s-25s.mov

Butchered it to 60 FPS, YUV420 and transcoded to VP8, VP9 then later AV1 both at 4k & 1080p (encodes are not equivalent to each other & may contain unused options):

ffmpeg transcodes: ```sh ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=3840x2160 -tile-columns 3 -g 240 -c:v libvpx -r 60 -row-mt 1 -crf 12 -b:v 23552K -pass 1 -pix_fmt yuv420p -an -f null /dev/null && ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=3840x2160 -tile-columns 3 -c:v libvpx -row-mt 1 -crf 12 -b:v 23552K -pass 2 -pix_fmt yuv420p -g 240 -r 60 -c:a libopus tearsofsteel_4k_09s-25s-60.webm ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=3840x2160 -tile-columns 3 -g 240 -c:v libvpx-vp9 -r 60 -row-mt 1 -crf 12 -b:v 23552K -pass 1 -pix_fmt yuv420p -an -f null /dev/null && ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=3840x2160 -tile-columns 3 -c:v libvpx-vp9 -r 60 -row-mt 1 -crf 12 -b:v 23552K -pass 2 -pix_fmt yuv420p -g 240 -c:a libopus tearsofsteel_4k_09s-25s-vp9-60.webm ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=3840x2160 -g 300 -c:v libaom-av1 -r 60 -row-mt 1 -tiles 4x1 -crf 16 -b:v 23552K -pass 1 -pix_fmt yuv420p -an -f null /dev/null && time ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=3840x2160 -c:v libaom-av1 -row-mt 1 -tiles 4x1 -crf 16 -b:v 23552K -pass 2 -pix_fmt yuv420p -g 300 -r 60 -c:a libopus tearsofsteel_4k_09s-25s-av1-60.webm ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=1920x1080 -tile-columns 3 -g 240 -c:v libvpx -r 60 -row-mt 1 -crf 12 -b:v 23552K -pass 1 -pix_fmt yuv420p -an -f null /dev/null && ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=1920x1080 -tile-columns 3 -c:v libvpx -row-mt 1 -crf 12 -b:v 23552K -pass 2 -pix_fmt yuv420p -g 240 -r 60 -c:a libopus tearsofsteel_1080_09s-25s-60.webm ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=1920x1080 -tile-columns 3 -g 240 -c:v libvpx-vp9 -r 60 -row-mt 1 -crf 12 -b:v 23552K -pass 1 -pix_fmt yuv420p -an -f null /dev/null && ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=1920x1080 -tile-columns 3 -c:v libvpx-vp9 -r 60 -row-mt 1 -crf 12 -b:v 23552K -pass 2 -pix_fmt yuv420p -g 240 -c:a libopus tearsofsteel_1080_09s-25s-vp9-60.webm ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=1920x1080 -g 300 -c:v libsvtav1 -preset 1 -r 60 -row-mt 1 -tiles 4x1 -crf 16 -b:v 23552K -pass 1 -pix_fmt yuv420p -an -f null /dev/null && time ffmpeg -i tearsofsteel_4k_09s-25s.mov -vf scale=1920x1080 -c:v libsvtav1 -preset 1 -row-mt 1 -tiles 4x1 -crf 16 -b:v 23552K -pass 2 -pix_fmt yuv420p -g 300 -r 60 -c:a libopus tearsofsteel_1080_09s-25s-av1-stv-60.webm ```

Well -vf scale= didn't do quite what I was expecting. Played via renpy.movie_cutscene looping once each time; playback order 4k:VP8-VP9-AV1, 1080:VP8-VP9-AV1:

renpy.movie_cutscene() ```python $ renpy.movie_cutscene("images/tearsofsteel_4k_09s-25s-60.webm", loops=1) $ renpy.movie_cutscene("images/tearsofsteel_4k_09s-25s-vp9-60.webm", loops=1) $ renpy.movie_cutscene("images/tearsofsteel_4k_09s-25s-av1-60.webm", loops=1) $ renpy.movie_cutscene("images/tearsofsteel_1080_09s-25s-60.webm", loops=1) $ renpy.movie_cutscene("images/tearsofsteel_1080_09s-25s-vp9-60.webm", loops=1) $ renpy.movie_cutscene("images/tearsofsteel_1080_09s-25s-av1-stv-60.webm", loops=1) ```

New Ryzen system (Debian sid)

Default renpy-8.1.1-sdk lib:

4kt-lib-def-Zen4-8 1 1-sdk

MMX 4kt-lib-mmx-Zen4-8 1 1-sdk

As you can see both the VP8 & VP9 playback see a significant reduction in CPU with a MMX enabled build. AV1 is largely unaffected with this CPU.

Then I actually read the Tears of Steel about page. I did not include the credit scroll, so I'm not entirely certain how kosher it is to share it like this.

But it's trivial to reproduce as it applies to all VP8/VP9 videos.

renpytom commented 1 year ago

That's convicing. I've re-enabled MMX for now.