giongto35 / cloud-game

Web-based Cloud Gaming service for Retro Game
https://www.youtube.com/watch?v=GUBrJGAxZZg
Apache License 2.0
2.27k stars 345 forks source link

using cpu capabilities: none! --- H264 performance issue #274

Closed hwwxj closed 3 years ago

hwwxj commented 3 years ago

hi, when I was using h264 to encode the video, I can see the video begin to stuck.

the logs shows like this: I0202 16:25:33.807307 1926 media.go:134] Video Encoder: H264 x264 [info]: using cpu capabilities: none! x264 [info]: profile Constrained Baseline, level 5.0

is the performence issue caused by this: using cpu capabilities: none!

sergystepanov commented 3 years ago

Hi @hwwxj,

This is actually interesting. Didn't know that the x264 lib wrapper is compiled from an old source from the wrapper's repo and extlib use is broken right now. Sure, it'll cause a noticeable performance hit without SIMD CPU instructions. However, it shouldn't matter for relatively small game resolutions (<640x480). Seeing profile level 5 in your log, are you, by any chance, using HD resolutions, 720p or so? Since cloud-game encoders are just minimal usable prototypes, they should be very slow with pre-HD and further up resolutions. Or, maybe, you just have a slow CPU.

hwwxj commented 3 years ago

Hi @hwwxj,

This is actually interesting. Didn't know that the x264 lib wrapper is compiled from an old source from the wrapper's repo and extlib use is broken right now. Sure, it'll cause a noticeable performance hit without SIMD CPU instructions. However, it shouldn't matter for relatively small game resolutions (<640x480). Seeing profile level 5 in your log, are you, by any chance, using HD resolutions, 720p or so? Since cloud-game encoders are just minimal usable prototypes, they should be very slow with pre-HD and further up resolutions. Or, maybe, you just have a slow CPU.

@sergystepanov yes, I am using the 1080p resolution, and then found this issue. is there any chance that I can use hardware to accelerate the encode process by means of extern x264 lib, mentioned by this issue: gen2brain/x264-go#11

giongto35 commented 3 years ago

a side question, H264 has hardware support but VPX doesn't right? That's why you are exploring x264?

hwwxj commented 3 years ago

hardware support

yes

sergystepanov commented 3 years ago

yes, I am using the 1080p resolution, and then found this issue. is there any chance that I can use hardware to accelerate the encode process by means of extern x264 lib, mentioned by this issue: gen2brain/x264-go#11

What there was mentioned is that it's possible to enable support of those advanced CPU instructions (hello, using cpu capabilities: none!) when linking x264-go with an external libx264 lib where the optimizations enabled by default, but, according to the author, this option was broken some time ago with new library versions and now it may work only if you manually compile exactly the same lib version from gen2brain's wrapper source with assembler optimizations enabled or find this version properly compiled somewhere. The version is 20180214-2245-stable. You can look at the latest issue there about Ubuntu extlib for more info on that. Now, about hardware acceleration. First, libx264 doesn't support it. You need a different lib and a GPU in order to make it work. Writing manually a wrapper for some other libs will be quite hard, it should be possible with pipe-based encoders like FFmpeg or GStreamer. Cloud-retro is supposed to run inside containers, and I don't even know how hard to enable GPU-acceleration for them or does it even possible. So, hw-acceleration in cloud-retro won't happen any time soon or at all. It's more likely to be implemented in cloud-morph, really. It is, probably, worth trying to fix extlib, though, or migrate to GStreamer.

hwwxj commented 3 years ago

obably, worth trying to fix extlib, th

thanks a lot!

sergystepanov commented 3 years ago

So, I forked the wrapper lib and was able to fix linking with libx264. I'm gonna try to compare performance gains with and without asm optimizations if there any when I have time.

@giongto35, if it's significantly faster, what do you think, is it better to write our own h264 wrapper for the external lib only in the cloud-retro? That will be just a *.h structs wrapper and a couple of lib functions, plus you wrote a fast YCbCr converter which is used in the encoder. Or, in a fork, update С sources, fix both the Go wrapper and the extlib option? In the first case, it is most likely they will break it again in the future with incompatible changes, in the second -- it's more work but we would have a fallback option to unoptimized encoder built from the source.

image

giongto35 commented 3 years ago

@sergystepanov So you mean 1: write for Cloud Retro specifically, less work 2: write for the fork, update C source and Go Wrapper + extlib? more work I think you should go with 2, contribute back to the library, or you can make it a standalone, people will come to it when they need. CloudRetro is a customer of the lib.

giongto35 commented 3 years ago

In Cloud Morph, I indeed use FFMPEG for encoder, so it's easier to use Hardware encoder. However, I haven't try because I don't have GPU machine. One thing I wonder is when I use H264 encoder, the stream doesn't work on Mobile. That's why Im using VPX right now, but VPX doesn't have hardware Acceleration :-<

hwwxj commented 3 years ago

So, I forked the wrapper lib and was able to fix linking with libx264. I'm gonna try to compare performance gains with and without asm optimizations if there any when I have time.

@giongto35, if it's significantly faster, what do you think, is it better to write our own h264 wrapper for the external lib only in the cloud-retro? That will be just a *.h structs wrapper and a couple of lib functions, plus you wrote a fast YCbCr converter which is used in the encoder. Or, in a fork, update С sources, fix both the Go wrapper and the extlib option? In the first case, it is most likely they will break it again in the future with incompatible changes, in the second -- it's more work but we would have a fallback option to unoptimized encoder built from the source.

image

@sergystepanov

fast YCbCr converter

FYI, I found that the efficiency of YCbCr converter in x264-go is quite low,

image

I0204 23:05:53.512131 31980 encoder.go:79] H264: start encode one image before convert to YCbCr. the time is %v 2021-02-04 23:05:53.512142829 +0800 CST m=+17.347932716 after convert to YCbCr. the time is %v 2021-02-04 23:05:53.663298327 +0800 CST m=+17.499088233

I0204 23:05:53.669515 31980 encoder.go:79] H264: start encode one image before convert to YCbCr. the time is %v 2021-02-04 23:05:53.66953006 +0800 CST m=+17.505319930 after convert to YCbCr. the time is %v 2021-02-04 23:05:53.826771366 +0800 CST m=+17.662561280

1080p convert time:almost 151ms convert + encode time : 157ms

sergystepanov commented 3 years ago

I think you should go with 2, contribute back to the library, or you can make it a standalone, people will come to it when they need. CloudRetro is a customer of the lib.

(╯°□°)╯︵ ┻━┻

In Cloud Morph, I indeed use FFMPEG for encoder, so it's easier to use Hardware encoder. However, I haven't try because I don't have GPU machine.

Huh? Just for hacking, even second-gen Intel Cores should do. Otherwise, be ready to prepare a thick wallet for a cloud GeForce. I still don't get your idée fixe obsession with GPU-accelerated encoding. It's highly unlikely that some hardware encoder would magically fix all the issues for HD-streaming. Maybe, it helps with adequate performance gains, but on the other hand, it'd require a significant bitrate and bandwidth increase in order to compensate for video quality loss at the initial target bitrate compared to software encoders. For some reason, the main codec for Stadia is software VP9, maybe, it can handle 4K just fine. Soon they will switch it to the more demanding AV1 trying to decrease streaming traffic.

One thing I wonder is when I use H264 encoder, the stream doesn't work on Mobile. That's why Im using VPX right now, but VPX doesn't have hardware Acceleration :-<

As far as I know, it should work on Android and IOS. Maybe, some encoding params (profiles) are not compatible?

FYI, I found that the efficiency of YCbCr converter in x264-go is quite low,

I0204 23:05:53.512131 31980 encoder.go:79] H264: start encode one image before convert to YCbCr. the time is %v 2021-02-04 23:05:53.512142829 +0800 CST m=+17.347932716 after convert to YCbCr. the time is %v 2021-02-04 23:05:53.663298327 +0800 CST m=+17.499088233

I0204 23:05:53.669515 31980 encoder.go:79] H264: start encode one image before convert to YCbCr. the time is %v 2021-02-04 23:05:53.66953006 +0800 CST m=+17.505319930 after convert to YCbCr. the time is %v 2021-02-04 23:05:53.826771366 +0800 CST m=+17.662561280

1080p convert time:almost 151ms convert + encode time : 157ms

Yes, memory suffers too. And that's a lot of pixels, actually. In cloud-retro, it's implemented in C. Here is an idiomatic way of measuring/printing elapsed time in Go.

giongto35 commented 3 years ago

@sergystepanov I'm not obsessed in Hw encode ( and just realize it can be done on CPU. Thanks : D) I just want to know how much it will improve. So have you successfully measure how long it takes for each part of the pipeline from user input -> emulator -> user now? With it, I can imagine the performance gain better. user input -> event capture -> frame render -> encode Time => network Time => decode Time? : it will be very ideal .

sergystepanov commented 3 years ago

@sergystepanov I'm not obsessed in Hw encode ( and just realize it can be done on CPU. Thanks : D) I just want to know how much it will improve.

Why not just try OBS stream or Handbrake with and without it on any PC or laptop to get some approximations.

So have you successfully measure how long it takes for each part of the pipeline from user input -> emulator -> user now? With it, I can imagine the performance gain better. user input -> event capture -> frame render -> encode Time => network Time => decode Time? : it will be very ideal .

I have not. I know how to measure everything from the chain together but not all separately. Why do you need strictly client-related measurements? RTT can be extracted from the WebRTC stats and server code can be Go profiled easily.

Btw, I recommend you to read High Performance Browser Networking as a quick overview of some networking problems. Just a good book if you have nothing to (do) read (:

sergystepanov commented 3 years ago

Writing a new h264 encoder. Now it looks perfect to me (%

annoy33

@giongto35, that old bundled with wrapper h264 encoder is more like a raw idea, I wouldn't dare to use it in production. I think that if we don't use super-tuned encoding and AV manipulation code with handwritten assembler level optimizations, then there is no need for raw codec libs (vpx, x264, and so on) with so many ways to shoot yourself in the foot. FFmpeg or GStreamer will be the right direction. When I'm done rewriting the encoder for libx264 I'll add a write-up about some issues with h264 lib.

sergystepanov commented 3 years ago

Testing new encoder with synthetic data (pre-randomized RGB images)

And here are properly measured results for 1080p (which are insane):

-------------------------------------------------------------------
Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz
-------------------------------------------------------------------

name    old time/op    new time/op    delta
H264-4     282ms ± 2%      23ms ±23%  -91.70%  (p=0.000 n=10+9)

name    old alloc/op   new alloc/op   delta
H264-4    25.7MB ± 0%     0.8MB ±12%  -96.72%  (p=0.000 n=10+10)

name    old allocs/op  new allocs/op  delta
H264-4     5.18M ± 0%     0.09M ±12%  -98.31%  (p=0.000 n=10+10)

Found out about some interesting behavior of Go's bench start/stop timer.

sergystepanov commented 3 years ago

@hwwxj, I think the new encoder is ready for you to test if you are interested. The branch is feature/x264 (don't forget to install libx264 lib). Added some config options into the config.yaml file. For 1080p on my hardware, it's just 1 frame delay with h264 and 5 frames with vp8 codec. Now the bad part. h264 is noticeably worse visually for encoding low-res frames in realtime compared to vpx codecs. Yes, it's faster but there are a lot of visual artifacts you have to deal with. Right now I have strange color shimmering on jumping Mario (Super Mario) and horizontal lines in Sushi The Cat (I'll add clips into the PR when have time). These artifacts are present in the old h264 encoder as well. No idea how to get rid of that yet.