mamedev / mame

MAME
https://www.mamedev.org/
Other
7.95k stars 1.98k forks source link

Add new Lagless VSYNC ON Algorithm developed by Blur Busters #3344

Closed blurbusters closed 6 years ago

blurbusters commented 6 years ago

An algorithm for a tearingless VSYNC OFF that creates a lagless VSYNC ON.

An approximate synchronization of the real-world raster to the emulated raster, with a forgiving jitter margin (<1ms).

Essentially, a defacto rolling-window scanline buffer achieved via high-buffer-swap-rate VSYNC OFF (redundant buffer swaps) -- with the emulator raster scanning ahead of the realworld raster, with a sufficient padding margin for performance-jittering making scanline-exact unnecessary -- just approximate raster sync (within ~0.1ms to ~0.2ms pratical with C/C++/asm programming).

There are raster-polling functions on some platforms (e.g. RasterStatus.ScanLine as well as D3DKMTGetScanLine() ...) to let you know of the real-world raster. (Other platforms may need to extrapolate based on time intervals between VSYNC heartbeats). This can be used to synchronize to the emulator's raster within a rolling-window margin.

Simplified diagram: Lagless VSYNC ON

Long Explanation: https://www.blurbusters.com/blur-busters-lagless-raster-follower-algorithm-for-emulator-developers/

This is a 1/5th frame version, but in reality, frame slices can be as tiny as 1 or 2 scanlines tall - or a few scanlines tall, computer-performance-permitting.

Tests show that I successfully can do roughly 7,000 redundant buffer swaps per second (2 NTSC scanlines) in high level garbage-collected C# programming language of 2560x1440 framebuffers on a GTX 1080 Ti.

With proper programming technique, darn near 1:1 sync of real raster to virtual raster is possible now. But exact sync is unnecessary due to the jitter margin and can be computed (to within ~0.5-1ms margin) from an extrapolation between VSYNC timings if the platform has no raster-scanline register available.

Even C# programming was able to do this within a ~0.2-0.3ms jitter margin, except during garbage-collection events (which causes brief momentary surge of tearing artifact).

But C++ or C or assembler would have no problem, and probably can do it within <0.1ms -- permitting within +/- 1 scanline of NTSC (15.6 KHz scan rate). Input lag will be roughly equivalent to ~2x the chosen jitter margin you choose -- but if your performance is good enough for 1-emulated-scanline sync, that's literally 2/15625 second input lag (less than 0.2ms input lag) -- all successfully achieved with standard Direct3D or OpenGL APIs during VSYNC OFF operation for platforms that gives you access to polling the graphic's card current-raster register. The jitter margin can be automatic-adaptive or in a configuration file.

Can be made compatible with HLSL (but larger jitter margin will essential, e.g. 1ms granlarity rather than 0.1ms granularity) since you're forcing the GPU to continually reprocess HLSL. A performance optimization is that one could modify the HLSL to only process frameslice-by-frameslice worth), but this would be a difficult rearchitecturing, I'd imagine.

Most MAME emulators emulate graphics in a raster-based fashion, and would be compatible with this lagless VSYNC ON workflow, with some minor (eventual, long-term) architectural changes to provide the necessary hooks for mid-frame busywaiting on realworld raster + mid-frame buffer swaps.

To read more about this algorithm and its programming considerations, see https://www.blurbusters.com/blur-busters-lagless-raster-follower-algorithm-for-emulator-developers/

blurbusters commented 6 years ago

FYI, this is also called "beam racing" for others who are familiar with this term.

It's currently used in some applications like VR: https://www.imgtec.com/blog/reducing-latency-in-vr-by-using-single-buffered-strip-rendering/

That article has some neat animated graphics at low granularity (4 frameslices) that will help make it easier to understand the concept of beam-racing techniques. Fine granularity is quite doable (10-scanline frameslices and smaller) with tight racebehind.

The algorithm would be much simpler and lower-bandwidth with front buffer rendering (just add rasters) but the bandwidth of VSYNC OFF frame buffer transmissions are now sufficiently high enough on modern computers to emulate front buffer rendering without access to the front buffer -- even down to the single-emulator-scanline granularity (but realistically, with a forgiving jitter margin of a few scanlines).

rb6502 commented 6 years ago

This would dramatically degrade the performance of post-processing shaders and GPU artwork compositing, it would break GSync and FreeSync, scanline polling is aggressively non-portable, and it's completely incompatible with our goal of eventually rasterizing polygonal games on the GPU. So it's an easy "no".

blurbusters commented 6 years ago

Are you sure it's a forever no? From a video game preservation perspective, it is important to preserve origijnal latency of 8-bit games -- and not worsen/close the door on it. And it's not that non-portable.

It's already implemented as an optional mod to MAME

An experimental implementation for MAME: https://forums.blurbusters.com/viewtopic.php?f=10&p=31750#p31750 In addition WInUAE has agreed to implement this algorithm. Keep tuned...

The algorithm can be made portable

Use Linux Modeline or Windows QueryDisplayConfig() to get the exact horizontal scan rate AND the Vertical Total and subtract vertical resolution from it to get the size of VBI in number of scanlines. With these two numbers (VBI size divided by horizontal scan rate), you can get the exact VBI time to the sub-microsecond on any graphics card. By knowing your computer's exact VBI size to the microsecond (and fortunately using standard Windows API -- it works!) it is possible to begin beam racing only a few scanlines ahead of a real raster. Incidentially, for platforms that don't give you a raster but give you a VBI heartbeat --

This can successfully extrapolate the predicted raster value (to single-NTSC-scanline accuracy) from just knowing (A) Vertical Total, (B) Vertical Resolution, and (C) VSYNC heartbeat.

All this is available on many platforms (standard Linux modelines, standard Win32 API calls, etc). A module would be able to blackbox this to scanline accuracy without needing to poll a real raster register too (for fixed-Hz displays too).

You can guess the raster to the nearest scanline WITHOUT polling hardware

I'm a wizard at Linux modelines & Custom Resolutions, and I'm able to extrapolate a line-accurate raster based on a VBI heartbeat (vblank interval) -- so you do not have to poll a hardware-based raster register, if (A) you know the mathematics of video signal timings, (B) have access to a microsecond-accurate clock, and (C) access to a VBI heartbeat.

Even Android apps are doing some beamchasing (for virtual reality) so beam chasing is cross-platform already today in real-world implementations. Android beamchasing here: https://www.imgtec.com/blog/reducing-latency-in-vr-by-using-single-buffered-strip-rendering/

NVIDIA is now making this a more widely accessible API too, because of latency demands of virtual reality rendering, made it necessary to do beamchasing in 3D rendering -- so APIs are emerging too (above and beyond the already cross platform techniques I know)

So this can still be made cross platform already if you glue together all the scatterd information I already know, and merge into a cross-platform module that can waterfall from the direct polls to generating an approximated poll, etc.

At the end of the day, a roughly 80% cross-platform solution is possible to create a RasterScanLine.(dll / lib / so / etc) cross-platform library that does the following:

A standalone portable raster module can cascade a waterfall down from hardware polls to software emulation of the raster register, and configurability (tile size / strip size / granularity). The only job for a raster game module would be to make a single call everytime a raster is rendered -- such a theoretical module/library would handle the rest.

Maybe it doesn't belong in MAME this year, but don't close the door to future years -- after all, HLSL took a long time.

It is achievable with garden variety APIs

This algorithm is compatible with garden-variety front buffer rendering or high-frequency VSYNC OFF page flipping. Also, DisplayPort, DVI, HDMI are all raster-based outputs with top-to-bottom scanning!

Those are pretty industry standard API workflows that works with almost any 3D graphics language. It is NOT as unportable as you think --

It's simple Video Timings mathematics (For those who understand video timings), and the same concept works on Linux, Mac, and PC. Obviously, this optional mode should only be enabled Hz-for-Hz mode.

Different sizes of VBI doesn't break things

Also: VBI size apparently doesn't matter, and scaling doesn't matter (as long as you convert values accordingly, to keep same relative physical-margin). Beam racing works fine with scaling (e.g. 540 of 1080p corresponding to 120 of 240p), non-linear scaling (e.g. HLSL) with only hundreds millisecond degradation in original-lag accuracy, and if you want border effects, just scale a little differently, the important thing is that the emulated raster is physically below the real raster, however you decide to map-out the emulator layout (HLSL effects, raster fuzz effects, border effects, whatnot). For curved HLSL scanlines, the realraster would just have to be above the highest-most pixel of that HLSL raster -- somewhat of a vertical upwards shift (bigger jitter margin). There may be slight divergences in lag linearity if the VBI-to-active ratio of emulator versus realworld is different, but typically this is hundred-microsecond-scale stuff. And besides, you can use CRU (Custom Resolution Utility) to make sure that the VBI is a 480:525 ratio on whatever higher resolution you want (e.g. VT1181 for 1080p, which many newer LCD monitors will sync to -- since 1080:1181 is almost identical to 480:525 ratio of NTSC for VBI time to active time) -- that is, if you indeed an exact-ratio VBI (but that's cherrypicking microseconds, at this stage -- But at least you have that option). Regardless, the beam chasing algorithm is VBI-size-independent, any VBI-time-ratio differences only introduces minor vertical lag gradient nonlinearities (in hundreds microsecond timescales) between emu "signal" and real signal. But you would gain the option of exact timings match (no scaling), or exact ratio-match.

Compositing performance

Some graphics cards have incredibly high compositing performance now, when running uncapped, some MAME games composite HLSL effects at 600+ frames per second! Compositing performance will still work with a lower granularity (e.g. 10 tiles or 20 tiles per refresh cycle).

Remember, today's iPhones are blitting insane number of pixels and the graphics bandwidth is continually increasing. Compositing performance would eventually not be an issue, especially if you subdivide the screen by only tenths.

I recall that MAME once refused HLSL. Now, it's part of HLSL. The same may happen to realtime beam chasing. MAME historically doesn't necessarily need a GPU to run, but a GPU is now recommended.

And did you know GPU's are adding beam racing implementations because of virtual reality?

GSYNC/FreeSync is still essentially variable-size VBI

As you know, Blur Busters was the one that popularized the lag mechanics of GSYNC/FreeSync, and its huge benefits to emulators, so believe me, I know!

Also (except for weird audio distortions), beam chasing is not 100% fundamentally incompatible with can also reduce lag further with FreeSync/GSYNC too. The raster register still works (for the visible line) -- it's simply a variable size blanking interval. FreeSync is simply a variable-size VBI so you can still beamrace the scanout, beginning with the first buffer-flip. It would requires GSYNC+VSYNC OFF or FreeSync+VSYNC OFF combined, but you just enable beamchasing beginning with the first buffer flip that triggers the new refresh cycle (once the emu raster is a few lines ahead), then you beamchase one refresh cycle as if that specific (Variable) refresh cycle was simply a fixed refresh cycle.

Beam-chasing will also work with high-acceleration (e.g. fastscan the refresh) with long VBI pauses between refresh cycles -- which will cause some audio distortions unless the audio is retimed on the fly.

That said, beamchasing probably should (For simplicity) be disabled automatically when running in high-scan-velocity VRR mode (e.g. 60fps with 1/240sec scanout velocities on a 240Hz monitor). But I simply only say this, mention this, only as a matter to demonstrate that it's not quite "incompatible" as you make it out to be.

That said, people do sometimes enable weird config combos, like speedups/slowdowns of MAME modules (e.g. 50/60 Hz speedups/slowdowns) and this is just a new quirk side effect.

MAME does have a 60fps synchronized execution mode, and that's the mode that can benefit from the optional enabling of a raster-chasing module.

It can be an optional feature

Real time beam chasing can be an optional feature, and only a feature available to raster-based game modules. You have modules that only enable themselves for certain games -- e.g. the 6502 module or the Zilog Z80 module. The Nintendo 64 never uses these modules! The raster-chaser module could be an optional module. Think differently.

From what you are saying, you're willing to degrade 8-bit device lag to support more modern 3D games? Close the door forever on original lagfeel?

This and your goals are not necessarily mutually exclusive. I'd ideally suggest this is either an "incubate" or "low priority" or "need tests" or "defer" status on this item, rather than a Closed state. There is some architectures that would turn this into a minor modification of MAME.

You may say no, but other emu authors have agreed to imple) viment this algorithm, so keep an open mind, at least (as a Long Term possibility) as long as the hooks are not too messy, is sufficiently self-contained into a portable module (as described above).

The same API hooks may also work with future rendering architectures (e.g. 1000 Hz displays coming 2025 via coarse-scanning -- like future HLSL that does real-time CRT scanning emulations on future, more powerful GPUs (at least at a alpha-blended tile level -- like the frames of a high speed videos of a CRT) -- creating a low persistence scanned HLSL emulation on a 1000Hz display (like software-based-rolling-scan BFI). So the raster hooks doesn't necessarily have to be used only for real time beam chasing, but also for mapping 60Hz creatively onto ultra-high-Hz displays. The door SHOULD remain open to raster hooks for game preservationist-path purposes.

Again, what I am saying is that raster hooks are becoming more-and-more portable and game preservationist friendly, if you read all of what I've wrote.

After all, virtual reality is bringing back a lot more beam racing APIs back to many platforms!

MooglyGuy commented 6 years ago

What RB is saying, albeit cryptically and a little bit too bluntly, is that the one graphics programmer on the team - me - has no desire to implement this feature request, as he's got quite a lot of non-graphics-related things on his plate already.

If you code with the same level of alacrity with which you pitch this technology, then we all eagerly await your pull request. But no level of salesmanship is going to interest me in spending the rather considerable amount of time necessary to implement this in an appropriately OS-agnostic, platform-agnostic, and efficient way.

rb6502 commented 6 years ago

But that's not all I'm saying. Calamity has implemented this for GroovyMAME and if I'm understanding correctly it breaks every driver that doesn't support partial updating, which is most of them. (Also, Calamity told this guy to stop bothering us and you see how well that worked).

Moreover it can't work with OpenGL, on Linux, or on MacOS. He yells "modelines" to cover Linux, at a time when X11 and modelines are halfway down the toilet, and when people not running GroovyMAME are using auto-generated video modes created from EDID data so no modelines exist.

blurbusters commented 6 years ago

MooglyGuy

What RB is saying, albeit cryptically and a little bit too bluntly, is that the one graphics programmer on the team - me - has no desire to implement this feature request, as he's got quite a lot of non-graphics-related things on his plate already.

Fair enough! I'm talking to a few (eager ones) about the possibility of a small open-source cross-platform beam chasing discovery library for any "VSYNC OFF" tearing compatible API (including OpenGL) to hide all the mess. Oculus already has an equivalent for Android. There is some code in Blur Busters Strobe Utility I'm going to donate to the cause. In a couple years, hopefully it will gradually lower the complexity bar for the VR/emu/etc devs who have interest in it.

(Also, Calamity told this guy to stop bothering us and you see how well that worked).

Apologies -- mail crossed each other. Calamity posted that message after my last post (timestamps are in different timezones). I won't bother you anymore; apologies. You do a great job as it is!

This is my last message on this issue item, to apologize.

mdrejhon commented 4 years ago

Just to update everyone, the ghost is me.

I converted my account from personal to business, and it ghosted my previous username. Ouch. Oh well. But I'm the original creator of this thread.