moonlight-stream / moonlight-qt

GameStream client for PCs (Windows, Mac, Linux, and Steam Link)
GNU General Public License v3.0
10.99k stars 639 forks source link

M2 Mac decode time longer than M2 iPad decode time #1087

Closed lprhodes closed 5 months ago

lprhodes commented 1 year ago

Describe the bug I've noticed that the display decode time of over 3ms introduces notable lag when using a mouse and keyboard when playing FPS games. Thankfully using the moonlight client on an ROG Ally and an Intel-based gaming laptop shows a decode time of less than 1ms and everything feels amazing.

A Samsung Galaxy Tab 9 Ultra with Snap Dragon Gen 2 however has a 9ms decode time and so the notable lag and a "motion blur" effect is introduced, making FPS games much harder to play.

I then tried Moonlight on an M2 Mac and noticed the same lag and motion blur feel (due to the ~4ms to ~6ms decode time). I also tested using Parallels with the Windows ARM build of moonlight which had similar results for h264 decoding, although it was using a software decoder in this case.

I finally tried Moonlight for iOS on an M2 iPad and the same issue didn't exist. I can't see the decode time, however it feels as responsive as the ROG Ally and Intel based gaming laptop so I would assume it's ~1ms.

What I'd like to find out is why an M2 iPad is performing better than an M2 MacBook (and also an M2 Ultra Mac Studio) when in theory it's using the same hardware decoders. I'd subsequently like to find out why Mac's are performing so terrible while their decoders are known to be pretty decent.

I'm an iOS/Mac software engineer and would be happy to look into this myself, but any kind of pointers would be much appreciated.

Also, is the Snapdragon Gen 2 decoder really so bad that it should be reaching ~9ms for decoding?

Note: I used the same ethernet connection for all devices with <1ms network latency.

Steps to reproduce Stream using Moonlight from Sunshine on an M2 Mac (or M2 Max, or M2 Ultra) and compared to streaming using M2 iPad.

Affected games BF2042, COD MW2, MW3

Moonlight settings (please complete the following information)

Client PC details (please complete the following information)

Server PC details (please complete the following information)

Additional context Anything else you think may be relevant to the issue

cgutman commented 1 year ago

It's probably related to the different renderers in use on iOS vs macOS. On the iOS client, we feed an AVSampleBufferDisplayLayer the undecoded H.264 and HEVC bitstream in CMSampleBuffers. It internally decodes the compressed video stream to YUV data and renders that to the layer.

On the macOS client, we feed the video data into the VideoToolbox APIs (via FFmpeg) which gives us back a CVPixelBufferRef of decoded YUV data that we then feed into an AVSampleBufferDisplayLayer.

I have not been very impressed with AVSampleBufferDisplayLayer's performance on the Mac. It seems to introduce additional display latency that's not present on other platforms (and not counted by our performance stats). I intend to work on a real Metal renderer sometime soon (though I certainly wouldn't object if you wanted to take a stab at that). The CVPixelBuffer should be mappable to a MTLTexture via CVMetalTextureCacheCreateTextureFromImage(), then that can be rendered via Metal.

I'd be curious if you see differences in behavior if you set the VT_FORCE_INDIRECT=1 environment variable and then launch Moonlight. Rather than passing the CVPixelBuffer directly to AVSampleBufferDisplayLayer (theoretically a zero-copy operation), we will instead use SDL's Metal renderer but it requires us to map and lock the CVPixelBuffer, read all the data out, and upload that to an SDL texture to render it. It's less efficient (though not as bad on unified memory architectures like Apple Silicon) but it may avoid the latency penalty of AVSampleBufferDisplayLayer.

belthesar commented 10 months ago

I've been dealing with some pretty bad latency connecting via local network from my M1 MacBook Pro to my Windows PC via Moonlight. Largely, the issue is due to mouse lag, regardless of using RDP mode or not. On a lark though, I decided to try launching Moonlight with this environment variable, and a significant amount of the lag I'm experiencing is gone. Still need to dig through other issues to see if there is more I can do to improve performance here, but this certainly helps quite a bit.

maxiedaniels commented 6 months ago

Sorry how did you try VT_FORCE_INDIRECT=1? I exported it in my .zshrc file and restarted moonlight but no luck. Do you have to launch moonlight with the variable as a parameter or something

cgutman commented 6 months ago

The latest nightly builds have a new Metal-based renderer that should avoid the need for VT_FORCE_INDIRECT=1.

You can test it from here: https://ci.appveyor.com/project/cgutman/moonlight-qt/builds/49795697/job/q2f3adjbrm2uhjap/artifacts

belthesar commented 6 months ago

Exciting! I've downloaded the build, and I'll try it out over the next week or so to see how things work! Thanks for putting time into improving rendering on ARM Macs!

maxiedaniels commented 6 months ago

Is there anyway to make sure it's running on Metal?

cgutman commented 6 months ago

In the Moonlight log file (located in /tmp), you will see Using VideoToolbox Metal renderer if it's using the new one or Using VideoToolbox AVSampleBufferDisplayLayer renderer if it's using the old one.

The Metal renderer should be used by default for all Macs on macOS 11 or later, except for old Mac Pro and iMac Pro systems that don't have integrated graphics.

belthesar commented 6 months ago

Just getting around to testing so far, but early testing feels like a significant improvement over the previous renderer pipeline. Only doing basic RD-style tasks at the moment, but about to fire up a game to test more.

moi952 commented 6 months ago

In the Moonlight log file (located in /tmp), you will see Using VideoToolbox Metal renderer if it's using the new one or Using VideoToolbox AVSampleBufferDisplayLayer renderer if it's using the old one.

The Metal renderer should be used by default for all Macs on macOS 11 or later, except for old Mac Pro and iMac Pro systems that don't have integrated graphics.

I tried your version, I think I notice a change when I move my mouse but I'm not sure. Decoding latency remains high (5ms), Apple silicon chips are bad at decoding (compared to Intel / Nvidia)?

belthesar commented 6 months ago

I've been driving this for about 24 hours now, and there are definitely some major improvements here. Unlike the previous renderer, I do not receive frequent stutters on the new renderer. Because of the lack of stutters, I also seem to get better mouse movements.

An issue I still have that I thought would be related to bad renderer interruptions is that when holding a key on my keyboard for movement, the longer I hold it, the slower it seems to respond to a release, leading me to believe there's a buffering issue there. I haven't checked for other issues to see if that's a reported problem, but I'm guessing it's a separate issue.

I haven't checked decode delay itself (definitely feels like there's a >1ms delay there), but the lack of stutters makes it more than fine for my current usecase.

lprhodes commented 6 months ago

I no longer have a Mac to verify on unfortunately but before I sold it I ended up building and running the iPad app on macOS which ran amazingly compared to the Mac specific build.

5ms decoding latency doesn’t seem high FWIW. I couldn’t perceive anything under 20ms, but I guess it’d be different for everyone.

belthesar commented 6 months ago

5ms decode time is only part of the equation though. Assuming a LAN, it's likely reasonable to assume there's ~1ms of transit time between the encoder and the decoder, 1-5ms of transit time if WLAN, and some MS of encode time (I don't have data from my Sunshine instance to know what the encode times look like there), but if you're pushing above 16ms, that means you're in 1 frame of lag territory (assuming 60 FPS).

In my experiences, I'm not seeing better than 5ms decode times, regardless of codec. What I am experiencing however, is a complete reduction of stutters and freezes using the new rendering method.

cgutman commented 5 months ago

New Metal renderer released in v6.0.0