libretro / RetroArch

Cross-platform, sophisticated frontend for the libretro API. Licensed GPLv3.

http://www.libretro.com

GNU General Public License v3.0

10.38k stars 1.84k forks source link

Add Beam Racing/Scanline Sync to RetroArch (aka Lagless VSYNC) #6984

Open blurbusters opened 6 years ago

blurbusters commented 6 years ago

Feature Request Description

A new lagless VSYNC technique has been developed that is already implemented in some emulators. This should be added to RetroArch too.

Bounty available

There is currently a BountySource of about $500 to add the beam racing API to RetroArch plus support at least 2 emulator modules (scroll below for bounty trigger conditions). RetroArch is a C / C++ project.

Synchronize emu raster with real world raster to reduce input lag

It is achieved via synchronizing the emulator's raster to the real world's raster. It is successfully implemented in some emulators, and uses less processing power than RunAhead, and is more forgiving than expected thanks to a "jitter margin" technique that has been invented by a group of us (myself and a few emulator authors).

For lurkers/readers: Don't know what a "raster" or "beam racing" is? Read WIRED Magazine's Racing the beam article. Many 8-bit and 16-bit computers, consoles and arcade machines utilized similar techniques for many tricks, and emulators typically implement them

Already Proven, Already Working

WinUAE -- GitHub Issue
WinUAE -- Download Announcement GroovyMAME -- Dropbox .7z file: Successful experiment via unsubmitted patch by Calamity (and thread)

There is currently discussion between other willing emulator authors behind the scenes for adding lagless VSYNC (real-world beam racing support).

Preservationist Friendly. Preserves original input lag accurately.

Beam racing preserves all original latencies including mid-screen input reads.

Less horsepower needed than RunAhead.

RunAhead is amazing! That said, there are other lag-reducing tools that we should also make available too.

Android and Pi GPUs (too slow for RunAhead in many emulators) even work with this lag-reducing technique.

Beam racing works on PI/Android, allows slower cycle exact emulators to have dramatic lag reductions, We have found it scales in both direction. Including Android and PI. Powerful computers can gain ultra-tight beam racing margins (sync between emuraster and realraster can be sub-millisecond on GTX 1080 Ti). Slower computers can gain very forgiving beam racing margins. The beam racing margin is adjustable -- can be up to 1 refresh cycle in size.

In other words, graphics are essentially raster-streamed to the display practically real-time (through a creative tearingless VSYNC OFF trick that works with standard Direct3D/OpenGL/Metal/etc), while the emulator is merrily executing at 1:1 original speed.

Diagrammatic Concept

Lagless VSYNC

Lagless VSYNC jitter margin

Just like duplicate refresh cycles never have tearlines even in VSYNC OFF, duplicate frameslices never have tearlines either. We're simply subdividing frames into subframes, and then using VSYNC OFF instead.

We don't even need a raster register (it can help, but we've come up with a different method), since rasters can be a time-based offset from VSYNC, and that can still be accurate enough for flawless sub-millisecond latency difference between emulator and original machine.

Emulators can merrily run at original machine speed. Essentially streaming pixels darn-near-raster-realtime (submillisecond difference). What many people don't realize is 1080p and 4K signals still top-to-bottom scan like an old 60Hz CRT in default monitor orientation -- we're simply synchronizing to cable scanout, the scanout method of serializing 2D images to a 1D cable is fundamnetally unchanged. Achieving real raster sync between the emulator raster and real raster!

Many emulators already render 1 scanline at a time to an offscreen framebuffer. So 99% of the beam racing work is already done.

Simple Pre-Requisites

Distilling down to minimum requirements makes rasters cross-platform:

Platform supports a VSYNC OFF mode
Platforms is able to provide VSYNC timestamps
Platform supports high-precision counters (sub-millisecond-accuracy counters) Such as RTDSC or QueryPerformanceCounter or std::chrono::high_resolution_clock
PC, Mac, Android, Pi, Radeon, GeForce, Intel, all supports beamraced frame slice technique

We use beam racing to hide tearlines in the jitter margin, creating a tearingless VSYNC OFF (lagless VSYNC ON) with a very tight (but forgiving) synchronization between emulator raster and real raster.

The simplified retro_set_raster_poll API Proposal

Proposing to add an API -- retro_set_raster_poll -- to allow this data to be relayed to an optional centralized beamracing module for RetroArch to implement realworld sync between emuraster and realraster via whatever means possible (including frameslice beam racing & front buffer beam racing, and/or other future beam racing sync techniques).

The goal of this API simply allows the centralized beamracing module to do an early peak at the incomplete emulator refresh cycle framebuffer every time a new emulator scan line has been plotted to it.

This minimizes modifications to emulators, allowing centralization of beam racing code.

The central code handle its own refresh cycle scanout synchronization (busylooping to pace correctly to real world's raster scan line number which can be extrapolated in a cross-platform manner as seen below!) without the emulator worrying about any other beam racing specifics.

Further Detail

Basically it's a beam-raced VSYNC OFF mode that looks exactly like VSYNC ON (perfect tearingless VSYNC OFF). The emulator can merrily render at 1:1 speed while realtime streaming graphics to the display, without surge-execution needed. This requires far less horsepower on the CPU, works with "cycle-exact" emulators (unlike RunAhead) and allows ultra low lag on Raspberry PI and Android processors. Frame-slice beam racing is already used for Android Virtual Reality too, but works successfully for emulators.

Which emulators does this benefit?

This lag reduction technique will benefit any emulator that already does internal beam racing (e.g. to support original raster interrupts). Nearly all retro platforms -- most 8-bit and 16-bit platforms -- can benefit.

This lag-reduction technique does not benefit high level emulation.

Related Raster Work on GPUs

Doing actual "raster interrupts" style work on Radeon/GeForces/Intels is actually surprisingly easy: tearlines are just rasters -- see YouTube video.

This provide the groundwork for lagless VSYNC operation, synchronization of realraster and emuraster. With the emulator method, the tearlines are hidden via the jittermargin approach.

Common Developer Misconceptions

First, to clear up common developer misconceptions of assumed "showstoppers"...

Yes, it can work with 60Hz, 120Hz, 180Hz, 240Hz (simply beam racing cherrypicked refresh cycles -- requires surge execution for beam racing "fast" refresh cycles), works in WinUAE
Yes, it's more forgiving than expected of computer performance fluctuations (jitter margin technique)
Yes, it can work simultaneously with RunAhead (if need be, though not necessary). Simply beam race the final/visible frame.
Yes, it works simultaneously with variable refresh rate (see this post), works in WinUAE
Yes, you can easly enter/exit beamracing mode on the fly (e.g. screen rotation to incompatible scan direction, switch to windowed operation)
Yes, it works with scaled and HLSL/shaders/fuzzylines, as it always works in WinUAE. It does slow things down, and requires optimizations to speed up again (but this can be solved as a separate optimization). Any distortions (e.g. curves, or line fuzz) can be hidden in the jitter margin height technique, to be 100% artifactless
Yes, it can be used in conjunction with black frame insertion (including for the 31KHz 240p compatibility mode for MAME arcade machines; though that will require 2x surge-execute during a fast 1/120sec scanout of the visible refresh cycle).

Proposal

Recommended Hook

Add the per-raster callback function called "retro_set_raster_poll"
The arguments are identical to "retro_set_video_refresh"
Do it to one emulator module at a time (begin with the easiest one).

It calls the raster poll every emulator scan line plotted. The incomplete contents of the emulator framebuffer (complete up to the most recently plotted emulator scanline) is provided. This allows centralization of frameslice beamracing in the quickest and simplest way.

Cross-Platform Method: Getting VSYNC timestamps

You don't need a raster register if you can do this! You can extrapolate approximate scan line numbers simply as a time offset from a VSYNC timestamp. You don't need line-exact accuracy for flawless emulator frameslice beamracing.

For the cross-platform route -- the register-less method -- you need to listen for VSYNC timestamps while in VSYNC OFF mode.

These ideally should become your only #ifdefs -- everything else about GPU beam racing is cross platform.

PC Version

Get your primary display adaptor URL such as \.\\DISPLAY1 .... For me in C#, I use Screen.PrimaryScreen.DeviceName to get this, but in C/C++ you can use EnumDisplayDevices() ...
Next, call D3DKMTOpenAdapterFromHdc() with this info to open the hAdaptor handle
For listening to VSYNC timestamps, run a thread with D3DKMTWaitForVerticalBlankEvent() on this hAdaptor handle. Then immediately record the timestamp. This timestamp represents the end of a refresh cycle and beginning of VBI.

Mac Version

Other platforms have various methods of getting a VSYNC event hook (e.g. Mac CVDisplayLinkOutputCallback) which roughly corresponds to the Mac's blanking interval. If you are using the registerless method and generic precision clocks (e.g. RTDSC wrappers) these can potentially be your only #ifdefs in your cross platform beam racing -- just simply the various methods of getting VSYNC timestamps. The rest have no platform-specificness.

Linux Version

See GPU Driver Documentation. There is a get_vblank_timestamp() available, and sometimes a get_scanout_position() (raster register equivalent). Personally I'd only focus on the obtaining VSYNC timestamping -- much simpler and more guaranteed on all platforms.

Getting the current raster scan line number

For raster calculation you can do one of the two:

(A) Raster-register-less-method: Use RTDSC or QueryPerformanceCounter or std::chrono::high_resolution_clock to profile the times between refresh cycle. On Windows, you can use known fractional refresh rate (from QueryDisplayConfig) to bootstrap this "best-estimate" refresh rate calculation, and refine this in realtime. Calculating raster position is simply a relative time between two VSYNC timestamps, allowing 5% for VBI (meaning 95% of 1/60sec for 60Hz would be a display scanning out). NOTE: Optionally, to improve accuracy, you can dejitter. Use a trailing 1-second interval average to dejitter any inaccuracies (they calm to 1-scanline-or-less raster jitter), ignore all outliers (e.g. missed VSYNC timestamps caused by computer freezes). Alternatively, just use jittermargin technique to hide VSYNC timestamp inaccuracies.

(B) Raster-register-method: Use D3DKMTGetScanLine to get your GPU's current scanline on the graphics output. Wait at least 1 scanline between polls (e.g. sleep 10 microseconds between polls), since this is an expensive API call that can stress a GPU if busylooping on this register.

NOTE: If you need to retrieve the "hAdaptor" parameter for D3DKMTGetScanLine -- then get your adaptor URL such as \.\\DISPLAY1 via EnumDisplayDevices() ... Then call D3DKMTOpenAdapterFromHdc() with this adaptor URL in order to open the hAdaptor handle which you can then finally pass to D3DKMTGetScanLine that works with Vulkan/OpenGL/D3D/9/10/11/12+ .... D3DKMT is simply a hook into the hAdaptor that is being used for your Windows desktop, which exists as a D3D surface regardless of what API your game is using, and all you need is to know the scanline number. So who gives a hoot about the "D3DKMT" prefix, it works fine with beamracing with OpenGL or Vulkan API calls. (KMT stands for Kernel Mode Thunk, but you don't need Admin priveleges to do this specific API call from userspace.)

Improved VBI size monitoring

You don't need raster-exact precision for basic frameslice beamracing, but knowing VBI size makes it more accurate to do frameslice beamracing since VBI size varies so much from platform to platform, resolution to resolution. Often it just varies a few percent, and most sub-millisecond inaccuracies is easily hidden within jittermargin technique.

But, if you've programmed with retro platforms, you are probably familiar with the VBI (blanking interval) -- essentially the overscan space between refresh cycles. This can vary from 1% to 5% of a refresh cycle, though extreme timings tweaking can make VBI more than 300% the size of the active image (e.g. Quick Frame Transport tricks -- fast scan refresh cycles with long VBIs in between). For cross platform frameslice beamracing it's OK to assume ~5% being the VBI, but there are many tricks to know the VBI size.

QueryDisplayConfig() on Windows will tell you the Vertical Total. (easiest)
Or monitoring the ratio of .INVBlank = true versus .INVBlank = false ... (via D3DKMTGetScanLine) by monitoring the flag changes (wait a few microseconds between polls, or 1 scanline delay -- D3DKMTGetScanLine is an 'expensive' API call)

Turning The Above Data into Real Frameslice Beamracing

For simplicity, begin with emu Hz = real Hz (e.g. 60Hz)

Have a configuration parameter of number of frameslices (e.g. 10 frameslices per refresh cycle)
Let's assume 10 frameslices for this exercise.
Actual screen 1080p means 108 real pixel rows per frameslice.
Emulator screen 240p means 24 emulator pixel rows per frameslice.
Your emulator module calls the centralized raster poll (retro_set_raster_poll) right after every emulator scan line. The centrallized code (retro_set_raster_poll) counts the number of emulator pixel rows completed to fill a frameslice. The central code will do either (5a) or (5b): (5a) Returns immediately to emulator module if not yet a full new framesliceful have been appended to the existing offscreen emulator framebuffer (don't do anything to the partially completed framebuffer). Update a counter, do nothing else, return immediately. (5b) However once you've got a full frameslice worth built up since the last frameslice presented, it's now time to frameslice the next frameslice. Don't return right away. Instead, immediately do an intentional CPU busyloop until the realraster reaches roughly 2 frameslice-heights above your emulator raster (relative screen-height wise). So if your emulator framebuffer is filled up to bottom edge of where frameslice 4 is, then do a busyloop until realraster hits the top edge* of frameslice 3. Then immediately Present() or glutSwapBuffers() upon completing busyloop. Then Flush() right away. NOTE: The tearline (invisible if unchanged graphics at raster are) will sometimes be a few pixels below the scan line number (the amount of time for a memory blit - memory bandwidth dependant - you can compensate for it, or you can just hide any inaccuracy in jittermargin) NOTE2: This is simply the recommended beamrace margin to begin experimenting with: A 2 frameslice beamracing margin is very jitter-margin friendly.

Example

Note: 120Hz scanout diagram from a different post of mine. Replace with emu refresh rate.matching real refresh rate, i.e. monitor set to 60 Hz instead. This diagram is simply to help raster veterans conceptualize how modern-day tearlines relates to raster position as a time-based offset from VBI

Lagless VSYNC jitter margin

Bottom line: As long as you keep repeatedly Present()-ing your incompletely-rasterplotted (but progressively more complete) emulator framebuffer ahead of the realraster, the incompleteness of the emulator framebuffer never shows glitches or tearlines. The display never has a chance to display the incompleteness of your emulator framebuffer, because the display's realraster is showing only the latest completed portions of your emulator's framebuffer. You're simply appending new emulator scanlines to the existing emulator framebuffer, and presenting that incomplete emulator framebuffer always ahead of real raster. No tearlines show up because the already-refreshed-part is duplicate (unchanged) where the realraster is. It thusly looks identical to VSYNC ON.

Precision Assumptions:

Scaling doesn't have to be exact.
The two frameslice offset gives you a one-frameslice-ahead jitter margin
You can vary the height of consecutive frameslices if you want, slightly, or lots, or for rounding errors.
No artifacts show because the frameslice seams are well into the jitter margin.

Special Note On HLSL-Style Filters: You can use HLSL/fuzzyline style shaders with frameslices. WinUAE just does a full-screen redo on the incomplete emu framebuffer, but one could do it selectively (from just above the realraster all the way to just below the emuraster) as a GPU performance-efficiency optimization.

Adverse Conditions To Detect To Automatically disable beamracing

Optional, but for user-friendly ease of use, you can automatically enter/exit beamracing on the fly if desired. You can verify common conditions such as making sure all is me:

Rotation matches (scan direction same) = true
Supported refresh rate = true
Module has a supported raster hook = true
Emulator performance is sufficient = true

Exiting beamracing can be simply switching to "racing the VBI" (doing a Present() between refresh cycles), so you're just simulating traditional VSYNC ON via VSYNC OFF via that manual VSYNC'ing. This is like 1-frameslice beamracing (next frame response). This provides a quick way to enter/exit beamracing on the fly when conditions change dynamically. A Surface Tablet gets rotated, a module gets switched, refresh rate gets changed mid-game, etc.

Questions?

I'd be happy to answer questions.

blurbusters commented 6 years ago

Additional timesaver notes:

General Best Practices

Debugging raster problems can be frustrating, so here's knowledge by myself/Calamity/Toni Wilen/Unwinder/etc. These are big timesaver tips:

Raster error manifests itself as tearline jitter.
If jitter is within raster jittermargin technique, no tearing or artifacts shows up.
It's an amazing performance profiling tool; tearline jitter makes your performance fluctuations very visible. In debug mode, use color-coded tints for your frameslices, to help make normally-hidden raster jitter more visible (WinUAE uses this technique).
Raster error is more severe at top edge than bottom edge. This is because GPU is more busy during this region (e.g. scheduled Windows compositing thread, stuff that runs every VSYNC event in the Windows Kernel, etc). It's minor, but it means you need to make sure your beam racing margin accomodate sthis.
GPU power management. If your emulator is very light on a powerful GPU, your GPU fluctuating power management will amplify raster error. Which may mean having too frameslices will have amplified tearline jitter. Fixes include (A) configure more frameslices (B) simply detect when GPU is too lightly loaded and make it busy one way or another (e.g. automatically use more frameslices). The rule of thumb is don't let GPU idle for more than a millisecond if you want scanline-exact rasters. Or you can just merely simply use a bigger jittermargin to hide raster jitter.
If you're using D3DKMTGetScanLine... do not busyloop on it because it stresses the GPU. Do a CPU busyloop of a few microseconds before polling the raster register again.
Do a Flush() before your busyloop before your precision-timed Present(). This massively increases accuracy of frameslice beamracing. But it can decrease performance.
Thread-switching on some older CPUs can cause RTDSC or QueryPerformanceCounter backwards clock ticking unexpectedly. So keep QueryPerformanceCounter polls to the same CPU thread with a BeginThreadAffinity. You probably already know this from elsewhere in the emulator, but this is mentioned here as being relevant to beamracing.
Instead of rasterplotting emulator scanlines into a blank framebuffer, rasterplot on top of a copy of the the emulator previous refresh cycle's framebuffer. That way, there's no blank/black area underneath the emulator raster. This will greatly reduce visibility of glitches during beamrace fails (falling outside of jitter margin -- too far behind / too far ahead) -- no tearing will appear unless within 1 frameslice of realraster, or 1 refresh cycle behind. A humongous jitter margin of almost one full refresh cycle. And this plot-on-old-refresh technique makes coarser frameslices practical -- e.g. 2-frameslice beamracing practical (e.g. bottom-half screen Present() while still scanning out top half, and top-half screen Present() while scanning out bottom half). When out-of-bounds happens, the artifact is simply brief instantaneous tearing only for that specific refresh cycle. Typically, on most systems, the emulator can run artifactless identical looking to VSYNC ON for many minutes before you might see brief instantaneous tearline from a momentary computer freeze, and instantly disappear when beamrace gets back in sync.
Some platforms supports microsecond-accurate sleeping, which you can use instead of busylooping. Some platforms can also set the granularity of the sleep (there's an undocumented Windows API call for this). As a compromise, some of us just do a normal thread sleep until a millisecond prior, then doing a busyloop to align to the raster.
Don't worry about mid-scanline splits (e.g. HSYNC timings). We don't have to worry about such sheer accuracy. The GPU transceiver reads full pixel rows at a time. Being late for a HSYNC simply means the tearline moves down by 1 pixel. Still within your raster jitter margin. We can jitter quite badly when using a forgiving jitter margin -- (e.g. 100 pixels amplitude raster jitter will never look different from VSYNC ON). Precision requirement is horizontal scanrate (e.g. 67KHz means 1/67000sec precision needed for scanline-exact tearlines -- which is way overkill for 10-frameslice beamracing which only needs 1/600sec precision at 60Hz).
Use multimonitor. Debugging is way easier with 2 monitors. Use your primary is exclusive full screen mode, with the IDE on a 2nd monitor. (Not all 3D frameworks behave well with that, but if you're already debugging emulators, you've probably made this debugging workflow compatible already anyway). You can do things like write debug data to a console window (e.g. raster scanline numbers) when debugging pesky raster issues.
Some digital display outputs exhibit micropacketization behavior (DisplayPort at lower resolutions especially, where multiple rows of pixels seem to squeeze into the same packet -- my suspicion). So your raster jitter might vibrate in 2 or 4 scan line multiples rather than single-scanline multiples. This may or may not happen more often with interleaved data (DisplayPort cable handling 2 displays or other PCI-X data) but they are still pretty raster-accurate otherwise, the raster inaccuracies are sub-millisecond, and fall far within jitter margin. Advanced algorithms such as DSC (Display Stream Compression of new DisplayPort implementations) can amplify raster jitter a bit. But don't worry; all known micro-packetization inaccuracies, fall far well within jittermargin technique, so no problem. I only mention this is you find raster-jitter differences between different video outputs.
Become more familiar with how the jitter-margin technique saves your ass. If you do Best-Practice #9, you gain a full wraparound jittermargin (you see, step #9 allows you to Present() the previous refresh cycle on bottom half of screen, while still rendering the top half...). If you use 10 frameslices at 1080p, your jitter safety margin becomes (1080 - 108) = 972 scanlines before any tearing artifacts show up! No matter where the real raster is, you're jitter margin is full wraparound to previous refresh cycle. The earliest bound is pageflip too late (more than 1 refresh cycle ago) or pageflip too soon (into the same frameslice still not completed scanning-out onto display). Between these two bounds is one full refresh cycle minus one frameslice! So don't worry about even a 25 or 50 scanline jitter inaccuracy (erratic beamracing where margin between realraster and emuraster can randomly vary) in this case... It still looks like VSYNC ON perfectly until it goes out of that 972-scanline full-wraparound jitter margin. For minimum lag, you do want to keep beam racing margin tight (you could make beamrace margin adjustable as a config value, if desired -- though I just recommend "aim the Present() at 2 frameslice margin" for simplicity), but you can fortunately surge ahead slightly or fall behind lots, and still recover with zero artifacts. The clever jittermargin technique that permanently hides tearlines into jittermargin makes frameslice beam-racing very forgiving of transitient background activity._
Get familiar with how it scales up/down well to powerful and underpowered platforms. Yes, it works on Raspberry PI. Yes, it works on Android. While high-frameslice-rate beamracing requires a powerful GPU, especially with HLSL filters, low-frameslice beamracing makes it easier to run cycle-exact emulation at a very low latency on less powerful hardware - the emulator can merrily emulate at 1:1 speed (no surge execution needed) spending more time on power-consuming cycle-exactness or ability to run on slower mobile GPUs. You're simply early-presenting your existing incomplete offscreen emulator framebuffer (as it gets progressively-more-complete). Just adjust your frameslice count to an equilibrium for your specific platform. 4 is super easy on the latest Androids and Raspberry PI (Basically 4 frameslice beam racing for 1/4th frame subrefresh input lag -- still damn impressive for a PI or Android) while only adding about 10% overhead to the emulator.
If you are on a platform with front buffer rendering (single buffer rendering), count yourself lucky. You can simply rasterplot new pixel rows directly into the front buffer instead of keeping the buffer offscreen (As you already are)! And plot on top of existing graphics (overwrite previous refresh cycle) for a jitter margin of a full refresh cycle minus 1-2 pixel rows! Just provide config parameter of of beamrace margin (vertical screen height percentage difference between emuraster + realraster), to adjust tightness of beamracing. You can support frameslicing VSYNC OFF technique & frontbuffer technique with the same suggested API, retro_set_raster_poll suggestion -- it makes it futureproof to future beamracing workflows.
Yes, it works with curved scanlines in HLSL/filter type algorithms. Simply adjust your beamracing margin to prevent the horizontally straight realraster from touching the top parts of curved emurasters. Usually a few pixel rows will do the job. You can add a scanlines-offset-adjustment parameter or a frameslice-count-offset adjustment parameter.
You may have to sometimes temporarily turn off debug output when programming/debugging real world beam racing. Some environments have too many raster glitches when a console window is running -- the IDE's console is surprisingly slow/inefficient. So, when running in debug mode, it may be better to create your own built-in graphics console overlay instead of a separate console window -- don't use debug console-writing to IDE or separate shell window during beam racing. It can glitch massively if you generate lots of debug output to a console window. Instead, display debug text directly in the 3D framebuffer instead and try to buffer your debug-text-writing till your blanking interval, and then display it as a block of text at top of screen (like a graphics console overlay). Even doing the 3D API calls to draw a thousand letters of text on screen, will cause far less glitches than trying to run a 2nd separate window of text (IDE debug overheads & shell window overheads) can cause massive beam-racing glitches if you try to output debug text -- some Debug output commands can cause >16ms stall -- I suspect that some IDE's are programmed in garbage-collected language and sometimes the act of writing console output causes a garbage-collect event to occur. Or some other really nasty operating-system / IDE environment overheads. So if you're running in debug mode while debugging raster glitches, then temporarily turn off the usual debug output mechanism, and output instead as a graphics-text overlay on your existing 3D framebuffer. Even if it means redundant re-drawing of a line of debugging text at the top edge of the screen every frame.
NOTE: Debug mode seems okay (good test of amplified raster jitter sometimes) on fast machines if debug output is temporarily disabled (or only used vary sparingly - rasters severely glitches when using Visual Studio debug console unless you've got a massively multithreaded CPU -- If you're only using a 2-core or 4-core CPU and need to debug raster-exactness problems, it is preferable to redraw onscreen characters (e.g. SpriteFonts) every frameslice instead as your onscreen graphical debug console -- that actually is less disruptive. Hopefully you don't need to do this, but be prepared to.

Hopefully these best practices reduce the amount of hairpulling during frameslice beamracing.

Special Notes

Special Note about Rotation Emulator devices already should report their screen orientation (portrait, landscape) which generally also defines scan direction. QueryDisplayConfig() will tell you real screen orientation. Default orientation is always top-to-bottom scan on all PC/Mac GPUs. 90 degree counterclockwise display rotation changes scan direction into left-to-right. If emulating Galaxian, this is quite fine if you're rotating your monitor (left-right scan) and emulating Galaxian (left-right scan) -- then beamracing works._
Special Note about Unsupported Refresh Rates Begin KISS and worry about 50Hz/60Hz only first. Start easy. Then iterate in adding support to other refresh rates like multiples. 120Hz is simply cherrypicking every other refresh cycle to beam race. For the in-between refresh cycles, just leave up the existing frame up (the already completed frame) until the refresh cycle that you want to beamrace is about to begin. In reality, there's very few unbeamraceable refresh rates -- even beamracing 60fps onto 75Hz is simply beamracing cherrypicked refresh cycles (it'll still stutter like 60fps@75Hz VSYNC ON though)._
Advanced Note about VRR Beam Racing Before beam racing variable refresh rate modes (e.g. enabling GSYNC or FreeSync and then beamracing that) -- wait until you've mastered all the above before you begin to add VRR compatibility to your beamracing. So for now, disable VRR when implementing frameslice beamracing for the first time. Add this as a last step once you've gotten everything else working reasonably well. It's easy to do once you understand it, but the conceptual thought of VRR beamracing is a bit tricky to grasp at first. VRR+VSYNC OFF supports beamracing on VRR refresh cycles. The main considerations are, the first Present() begins the manually-triggered refresh cycle (.INVBlank becomes false and ScanLine starts incrementing), and you can then frameslice beamrace that normally like an individual fixed-Hz refresh cycle. Now, one additional very special, unusual consideration is the uncontrolled VRR repeat-refresh. One will need to do emergency catchup beamraces on VRR displays if a display decides to do an uncommanded refresh cycle (e.g. when a display+GPU decides to do a repeat-refresh cycle -- this often happens when a display's framerates go below VRR range). These uncommanded refresh cycles also automatically occur below VRR range (e.g. under 30fps on a 30Hz-144Hz VRR display). Most VRR displays will repeat-refresh automatically until it's fully displayed an untorn refresh cycle. If this happens and you've already begun emulating a new emulator refresh cycle, you have to immediately start your beamrace early (rather than at the wanted precise time). So if you do a frameslice beamrace of a VRR refresh cycle, the GPU will send a repeat-refresh to the display automatically immediately. There might be an API call to suppress this behavior, but we haven't found one, so this behavior is unwanted so this kind of makes beamraced 60fps onto a 75Hz FreeSync display difficult to do stutter-free. But it works fine for 144Hz VRR displays - we find it's easy to be stutterfree when the VRR max is at least twice the emulator Hz, since we don't care about those automatic-repeat-refresh cycles that aren't colliding with timing of the next beamrace.

blurbusters commented 6 years ago

$120 Funds Now Added to BountySource

Added $120 BountySource -- permanent funds -- no expiry.

https://www.bountysource.com/issues/60853960-lagless-vsync-support-add-beam-racing-to-retroarch

Trigger for BountySource completion:

Add optional retro_set_raster_poll API and centralized beam racing code (or mutually agreed easier compromise)
Make any two emulator modules successfully work with it (on 3 platforms: on PC, on Mac, on Linux. See above for a list of API calls available on all 3 platforms)

Minimum refresh rate required: Native refresh rate.

Emulator Support Preferances: Preferably including either NES and/or VICE Commodore 64, but you can choose any two emulators that is easiest to add beam-racing to).

Notes: GSYNC/FreeSync compatible beam racing is nice (works in WinUAE) but not required for this BountySource award; can be a stretch goal later. Must support at least native refresh rate (e.g. 50Hz, 60Hz) but would be a bonus to also support multiple thereof (e.g. 100Hz or 120Hz) -- as explained, this is done via automatically cherrypicking which refresh cycles to beamrace (WinUAE style algorithm or another mutually agreed algorithm).

Effort Assessment

Assessment is that Item 1 will probably require about a thousand-ish lines of code, while item 3 (modification to individual emulator modules) can be as little as 10 lines or thereabouts. 99% of the beam racing is already implemented by most 8-bit and 16-bit emulators and emulator modules, it's simply the missing 1% (sync between emuraster and realraster) that is somewhat 'complicated' to grasp.

The goal is to simplify and centallize as much complexity of the beam racing centrally as possible, and minimize emulator-module work as much as possible -- and achieve original-machine latencies (e.g. software emulator with virtually identical latencies as an original machine) which has already been successfully achieved with this technique.

Most of the complexity is probably testing/debugging the multiple platforms.

It's Easier Than Expected. Learning/Debugging Is The Hard Part

Tony of WinUAE said it was easier than expected. It's simply learning that's hard. 90% of your work will be learning how to realtime-beamrace a modern GPU. 10% of your time coding. Few people (Except Blur Busters) understand the "black box" between Present() and photons hitting eyes. But follow the Best Practices and you'll realize it is as easy as an E=mc^2 Eureka Moment if you've ever programmed an Amiga Copper or Commodore 64 raster interrupt, that modern GPUs are surprisingly crossplatform-beamraceable now via "VSYNC OFF tearlines are simply rasters. All tearlines created in humankind are simply rasters" technical understanding.

BountySource Donation Dollar Match Thru the $360 Level

DOLLAR MATCH CHALLENGE -- Until End of September 2018 I will match dollar-for-dollar all additional donations by other users up to another $120. Growing my original donation to $240 in addition to $120 other people's donations = $360 BountySource!

EDIT: Dollar match maxed out 2018/07/17 -- I've donated $360

ghost commented 6 years ago

How could this possibly be done reliably on desktop OSes (non-hard-realtime) where scheduling latency is random?

blurbusters commented 6 years ago

How could this possibly be done reliably on desktop OSes (non-hard-realtime) where scheduling latency is random?

See above. It's already in an emulator. It's already successfully achieved.

That's not a problem thanks to the jittermargin technique.

Lagless VSYNC jitter margin

Look closely at the labels in Frame 3.

As long as the Present() occurs with a tearline inside that region, there is NO TEARING, because it's a duplicate frameslice at the screen region that's currently scanning-out onto the video cable. (As people already know, a static image never has a tearline -- tearlines only occurs with images in motion). The jitter margin technique works, is proven, is already implemented, and is already in a downloadable emulator, if you wish to see for yourself. In addition, I also have video proof below:

Remember, I am the inventor of TestUFO.com and founder of BlurBusters.com

If you've seen the UFO or similar tests in any website (RTings, TFTCentral, PCMonitors, etc, they are likely using one of my display-testing inventions that I've got a peer-reviewed conference paper with NIST.gov, NOKIA, Keltek. So my reputation precedes me, and now that out of the way:

As a result, I know what I am talking about.

You can adjust the jittermargin to give as much as 16.7ms of error margin (Item 9 of Best Practices above). Error margin with zero artifacts is achieved via jittermargin (duplicate frameslice = no tearline). Testing shows we can go ~1/4 refresh cycle errormargin on PI/Android and sub-millisecond errormargin on GTX 1080Ti + i7 systems.

Some videos I've created of my Tearline Jedi Demo --

Here's YouTube video proof of stable rasters on GeForces/Radeons: THREAD: https://www.pouet.net/topic.php?which=11422&page=1

And the world's first real-raster cross platform Kefrens Bars demo

(8000 frameslices per second -- 8000 tearlines per second -- way overkill for an emulator -- 100 tearlines per refresh cycle with 1-pixel-row framebuffers stretched vertically between tearlines. I also intentionally glitch it at the end by moving around a window; demonstrating GPU-processing scheduling interference).

Now, its much more forgiving for emulators because the tearlines (That you see in this) is all hidden in the jittermargin technique. Duplicate refresh cycles (and duplicate frameslices / scanline areas) have no tearlines. You just make sure that the emulator raster stays ahead of real raster, and frameslice new slices onto the screen in between the emuraster & realraster.

As long as you keep adding frameslices ahead of realraster -- no artifacts or tearing shows up. Common beam racing margins with WinUAE successfully is approximately 20-25 scanlines during 10 frameslice operation in WinUAE emulator. So the margin can safely jitter (computer performance problems) without artifacts.

Lagless VSYNC jitter margin

If you use 10 frameslices (1/10th screen height) -- at 60Hz for 240p, that's approximately a 1.67ms jitter margin -- most newer computers can handle that just fine. You can easily increase jitter margin to almost a full refresh cycle by adding distance between realraster & emuraster -- to give you more time to add new frameslices in between.

And even if there was a 1-frame mis-performance, (e.g. computer freeze), the only artifact is a brief sudden reappearance of tearing before it disappears.

Also, Check the 360-degree jittermargin technique as part of Step 9 and 14 of Best Practices, that can massively expand the jitter margin to a full wraparound refresh cycle's worth:

Instead of rasterplotting emulator scanlines into a blank framebuffer, rasterplot on top of a copy of the the emulator previous refresh cycle's framebuffer. That way, there's no blank/black area underneath the emulator raster. This will greatly reduce visibility of glitches during beamrace fails (falling outside of jitter margin -- too far behind / too far ahead) -- no tearing will appear unless within 1 frameslice of realraster, or 1 refresh cycle behind. A humongous jitter margin of almost one full refresh cycle. And this plot-on-old-refresh technique makes coarser frameslices practical -- e.g. 2-frameslice beamracing practical (e.g. bottom-half screen Present() while still scanning out top half, and top-half screen Present() while scanning out bottom half). When out-of-bounds happens, the artifact is simply brief instantaneous tearing only for that specific refresh cycle. Typically, on most systems, the emulator can run artifactless identical looking to VSYNC ON for many minutes before you might see brief instantaneous tearline from a momentary computer freeze, and instantly disappear when beamrace gets back in sync.

AND

Become more familiar with how the jitter-margin technique saves your ass. If you do Best-Practice 9, you gain a full wraparound jittermargin (you see, step 9 allows you to Present() the previous refresh cycle on bottom half of screen, while still rendering the top half...). If you use 10 frameslices at 1080p, your jitter safety margin becomes (1080 - 108) = 972 scanlines before any tearing artifacts show up! No matter where the real raster is, you're jitter margin is full wraparound to previous refresh cycle. The earliest bound is pageflip too late (more than 1 refresh cycle ago) or pageflip too soon (into the same frameslice still not completed scanning-out onto display). Between these two bounds is one full refresh cycle minus one frameslice! So don't worry about even a 25 or 50 scanline jitter inaccuracy (erratic beamracing where margin between realraster and emuraster can randomly vary) in this case... It still looks like VSYNC ON perfectly until it goes out of that 972-scanline full-wraparound jitter margin. For minimum lag, you do want to keep beam racing margin tight (you could make beamrace margin adjustable as a config value, if desired -- though I just recommend "aim the Present() at 2 frameslice margin" for simplicity), but you can fortunately surge ahead slightly or fall behind lots, and still recover with zero artifacts. The clever jittermargin technique that permanently hides tearlines into jittermargin makes frameslice beam-racing very forgiving of transitient background activity._

And single-refresh-cycle beam racing mis-sync artifacts are not really objectionable (an instantaneous one-refresh-cycle reappearance of a tearline that disappears when the beam racing "catches up" and goes back to the jitter margin tolerances.)

240p scaled onto 1080p is roughly 4.5 real scanlines per 1 emulator scanline. Obviously, the real raster "Register" will increment scan line number roughly 4.5 times faster. But as you have seen, Tearline Jedi successfully beam-races a Radeon/GeForce on both PC/Mac without a raster register simply by using existing precision counter offsets. Sure, there's 1-scanline jittering as seen in YouTube video. But tearing never shows in emulators because that's 100% fully hidden in the jittermargin technique making it 100% artifactless even if it is 1ms ahead or 1ms behind (If you've configured those beam racing tolerances for example -- can be made an adjustable slider -- tighter for super fast more-realtime systems -- a looser for slower/older systems).

But we're only worried about screen-height distance between the two. We need to merely simply make sure the emuraster is at least 1 frameslice (or more) below the realraster, relative-screen-height-wise -- and we can continue adding VSYNC OFF frameslices in between emu raster and real raster -- creating a tearingless VSYNC OFF mode, because the framebuffer swap (Present() or glutSwapBuffers()) is a duplicate screen area, no pixels changed, so no tearline is visible. It's conceptually easy to understand once you have the "Eureka" moment.

Lagless VSYNC jitter margin

There's already high speed video proof of sub-frame latencies (same-frame-response) achieved with this technique. e.g. mid-screen input reads for bottom-of-screen reactions are possible, replicating original's machine latency (to an error margin of one frameslice).

As you can see, the (intentionally-visible) rasters in the earlier videos are so stable and falls within common jittermargin sizes (for intentionally-invisible tearlines). With this, you create a (16.7ms - 1.67ms = 15ms jitter margin). That means with 10 frameslices with the refresh-cycle-wraparound jitter margin technique -- your beamracing can go too fast or too slow in a much wider and much safer 15ms range. Today, Windows scheduling is sub-1ms and PI schecduling is sub-4ms, so it's not a problem.

The necessary accuracy to do realworld beamracing happened 8-to-10 years ago already.

Yes, nobody really did it for emulators because it took someone to apply all the techniques together (1) Understanding how to beamrace a GPU, (2) Initially understanding the low level black box of Present()-to-Photons at least to the video output port signal level. (3) Understanding the techniques to make it very forgiving, and (4) Experience with 8-bit era raster interrupts.

In tests, WinUAE beam racing actually worked on a year-2010 desktop with an older GPU, at lower frameslice granularities -- someone also posted screenshots of an older Intel 4000-series GPU laptop in the WinUAE beamracing thread. Zero artifacts, looked perfectly like VSYNC ON but virtually lagless (well -- one frameslice's worth of lag).

Your question is understandable, but the fantastic new knowledge we all now have, now compensates totally for it -- a desktop with a GeForce GTX Titan about ~100x the accuracy margin needed for sub-refresh-latency frameslice beam racing.

So as a reminder, the accuracy requirements necessary to pull off this technical feat, already occured 8-to-10 years ago and the WinUAE emulator successfully is beamracing on an 8-year-old computer today in tests. I implore you to reread our research (especially the 18-point Best Practices), watch the videos, and view the links, to understand that it is actually quite forgiving thanks to the jittermargin technique.

(Bet you are surprised to learn that we are already so far past the rubicon necessary for this reliable accuracy, as long as the Best Practices are followed.)

blurbusters commented 6 years ago

BountySource now $140

Someone added $10, so I also added $10.

NOTE: I am currently dollar-matching donations (thru the $360 level) until end of September. Contribute to the pot: https://www.bountysource.com/issues/60853960-lagless-vsync-support-add-beam-racing-to-retroarch

blurbusters commented 6 years ago

BountySource now $200

Twinphalex added $30, so I also added $30.

blurbusters commented 6 years ago

$850 on BountySource

Wow! bparker06 just generously donated $650 to turn this into an $850 bounty

(bparker06, if you're reading this, reach out to me, will you? -- mark@blurbusters.com -- And to reconfirm you were previously aware that I'm currently dollar-matching only up to the BountySource $360 commitment -- Thanks!)

blurbusters commented 6 years ago

Now $1050 BountySource

I've topped up; and have donated $360 totalled -- the dollar-for-dollar matching limit promise I said earlier.

This is now number 32 biggest pot on BountySource.com at the moment!

blurbusters commented 6 years ago

So..... since this is getting to be serious territory, I might as well post multiple references that may be of interest, to help jumpstart any developers who may want to begin working on this:

Useful Links

Videos of GroovyMAME lagless VSYNC experiment by Calamity: https://forums.blurbusters.com/viewtopic.php?f=22&t=3972&start=10#p31851 (You can see the color filters added in debug mode, to highlight separate frameslices)

Screenshots of WinUAE lagless VSYNC running on a laptop with Intel GPU: http://eab.abime.net/showthread.php?p=1231359#post1231359 (OK: approx 1/6th frame lag, due to coarse 6 frameslice granularity.)

Corresponding (older) Blur Busters Forums thread: https://forums.blurbusters.com/viewtopic.php?f=22&t=3972

Corresponding LibRetro lag investigation thread (Beginning at post #628 onwards): https://forums.libretro.com/t/an-input-lag-investigation/4407/628

The color filtered frame slice debug mode (found in WinUAE, plus the GroovyMAME patch) is a good validation method of realtimeness -- visually seeing how close your realraster is to emuraster -- I recommend adding this debugging technique to the RetroArch beam racing module to assist in debugging beam racing.

Minimum Pre-Requisites for Cross-Platform Beam Racing

As a reminder, our research has successfully simplified the minimum system requirements for cross-platform beam racing to just simply the following three items:

Platform supports VSYNC OFF (aka "vblank disabled" mode)
Platform supports getting VSYNC timestamps
Platform supports high-precision counters (e.g. RTDSC or std::chrono::high_resolution_clock or QueryPerformanceCounter etc)

If you can meet (1) and (2) and (3) then no raster register is required. VSYNC OFF tearlines are just rasters, and can be "reliably-enough" controlled (when following 18-point Best Practices list above) simply as precision timed Present() or glutSwapBuffers() as precision-time-offsets from a VSYNC timestamp, corresponding to predicted scanout position.

Quick Reference Of Available VSYNC timestamping APIs

While mentioned earlier, I'll resummarize compactly: These "VSYNC timestamp" APIs have suitable accuracies for the "raster-register-less" cross platform beam racing technique. Make sure to filter any timestamp errors and freezes (missed vsyncs) -- see Best Practices above.

Windows: D3DKMTWaitForVerticalBlankEvent() (Works with OpenGL/Metal too)
MacOS: CVDisplayLinkOutputCallback()
Linux GPU driver API: get_vblank_timestamp()

If you 100% strictly focus on the VSYNC timestamp technique, these may be among the only #ifdefs that you need.

Other workarounds for getting VSYNC timestamps in VSYNC OFF mode

As tearlines are just rasters, it's good to know all the relevant APIs if need be. These are optional, but may serve as useful fallbacks, if need be (be sure to read Best Practices, e.g. expensiveness of API calls that we've discovered, and some mitigation techniques that we've discovered).

Be noted, it is necessary to use VSYNC OFF to use beam raced frameslicing. All known platforms (PC, Mac, Linux, Android) have methods that can access VSYNC OFF. On some platforms, this may interfere with your ability to get VSYNC timestamps. As a workaround you may have to instead poll the "In VBlank" flag (or busyloop in a separate thread for the bit-state change, and timestamp immediately after) -- in order to get VSYNC timestamps while in VSYNC OFF mode. Here are alternative APIs that helps you work around this, if absolutely necessary.

Windows D3DKMTGetScanLine() -- Windows equivalent of raster register. Also can be used to poll the "In VBLANK" status too. However, we found it unnecessary to do this, due to the existence of D3DKMTWaitForVBlank() which still works in VSYNC OFF mode. On the other hand, it may reduce need for tieing up a CPU core in precision-busylooping.
Linux get_scanout_position() -- Linux equivalent of raster register. Also can be used to poll the "In VBLANK" status too. drm_calc_vbltimestamp_from_scanoutpos() -- Linux Direct Rendering Manager (DRM) calculating VSYNC timestamps from scanout position. This can be quite handy to also do on Windows, as a low-CPU-method (no busylooping needed) method of generating VSYNC timestamps.

Currently, it seems implementations of get_vblank_timestamp() tend to call drm_calc_vbltimestamp_from_scanoutpos() so you may not need to do this. However, this additional information is provided to help speed up your research when developing for this bounty.

blurbusters commented 6 years ago

Bounty now $1132
Any platform or module with already-implemented beam racing technique -- is allowed to be rolled into the bounty as long as it helps meets the bounty conditions (e.g. ported to retro_set_raster_poll technique)
Bounty may be split between multiple programmers (if all stakeholders mutually agree). I understand not everyone can program all platforms.

As you remember, retro_set_raster_poll is supposed to be called every time after an emulator module plots a scanline to its internal framebuffer.

retro_set_raster_poll API proposal

As written earlier, basically retro_set_raster_poll (if added) simply allows the central RetroArch screen rendering code to optionally an "early peek" at the incompletely-rendered offscreen emulator buffer, every time the emulator modules plots a new scanline.

That allows the central code to beam-race scanlines (whether tightly or loosely, coarsely or ultra-zero-latency realtimeness, etc) onto the screen. It is not limited to frameslice beamracing.

By centralizing it into a generic API, the central code (future implementations) can decide how its wants to realtime-stream scanlines onto the screens (bypassing preframebuffering). This maximizes future flexibility.

VSYNC OFF frameslicing (add frameslices on the fly in the margin between emuraster and realraster)
Front buffer rendering (write emurasters directly to onscreen buffer, ahead of realraster)
Generous-jittermargins (see this post) versus single-scanline tightness (for virtually perfect original lag).
Single-scanline streaming (advanced RTOS techniques)
Mobile half-screen rendering (e.g. display bottom while rendering top, display top while rendering bottom) with almost zero extra CPU/GPU via the 2-frameslice tearingless VSYNC OFF technique.
Etc! Any other beam racing / beam chasing workflows -- including those not dreamed yet.

The bounty doesn't even ask you to implement all of this. Just 1 technique per 3 platforms (one for PC, one for Mac, one for Linux). The API simply provides flexibility to add other beamracing workflows later. VSYNC OFF frameslicing (essentially tearingless VSYNC OFF / lagless VSYNC ON) is the easiest way to achieve.

Each approach has their pros/cons. Some are very forgiving, some are very platform specific, some are ultra-low-lag, and some work on really old machines. I simply suggest VSYNC OFF frameslice beamracing because that can be implemented in exactly the same way on Windows+Mac+Linux, so is the easiest. But one realizes there's a lot of flexibility.

The proposed retro_set_raster_poll API call would be called at roughly the horizontal scanrate (excluding VBI scanlines). Which means for 480p, that API call would be called almost ~31,500 times per second. Or 240p that API would be called almost ~15000 times per second.

While high -- the good news is that this isn't a problem because most API calls would be an immediate return for coarse frameslicing. For example, WinUAE defaults at 10 frameslices per refresh cycle, 600 frameslices per second. So retro_set_raster_poll would simply do nothing (return immediately) until 1/10th of a screen height's worth of emulator scanlines are built up. And then will execute.

So out of all those tens of thousands of retro_set_raster_poll calls, only 600 would be 'expensive' if RetroArch is globally configured to be limited to 10-frameslice-per-refresh beam racing (1/10th screen lag due to beam chase distance between emuraster + realraster). The rest of the calls would simply be immediate returns (e.g. not a framesliceful built up yet).

Some emulator modules only need roughly 10 lines of modification

The complexity is centralized.

The emulator module is simply modified (hopefully as little as 10 line modification for the easiest emulator modules, such as NES) to call retro_set_raster_poll on all platforms. The beam racing complexity is all hidden centrally.

Nearly all 8-bit and 16-bit emulator modules already beamrace into their own internal framebuffers. Those are the 'easy' ones to add the retro_set_raster_poll API. So those would be easy. The bounty only needs 2 emulators to be implemented.

The central would decide how to beam race obviously (but frameslice beam racing would be the most crossplatform method, but it doesn't have to be the only method). Platform doesn't support it yet? Automatically disable beamracing (return immediately from retro_set_raster_poll). Screen rotation doesn't match emulator scan direction? Ditto, return immediately too. Whatever code a platform has implemented for beam racing synchronization (emuraster to realraster), it can be hidden centrally.

That's what part of bounty also pays for: Add the generic crossplatform API call so the rest of us can have fun adding various kinds of beam-racing possibilities that are appropriate for specific platforms. Obviously, the initial 3 platforms need to be supported (One for Windows, one for Mac, and one for Linux) but the fact that an API gets added, means additional platforms can be later supported.

The emulators aren't responsible for handling that complexity at all -- from a quick glance, it is only a ~10 line change to NES, for example. No #ifdefs needed in emulator modules! Instead, most of the beam racing sync complexity is centrallized.

asimonf commented 6 years ago

Would the behavior need to be adjusted for emulators that output interlaced content momentarily?

The SNES can switch from interlaced output to progressive during a vblank. Both NTSC and PAL are actually interlaced signals and the console is usually just rendering even lines (or is it odd lines? I don't recall now) using a technique commonly referred to double-strike.

aliaspider commented 6 years ago

I don't see why that would matter, the only requirement here is that the core can be run on a per scanline basis, and that the vertical refresh rate is constant and close to the monitor rate.

asimonf commented 6 years ago

I'm still wrapping my head around it, but yeah, now I see it. Interlaced content would be handled internally by the emulator as it already does.

blurbusters commented 6 years ago

About Interlaced Modes

No, behaviour doesn't need to be adjusted for interlaced.

Interlaced is still 60 temporal images per second, basically half-fields spaced 1/60 sec apart.

Conceptually, it's like frames that contains only odd scanlines, then a frame containing only even scanlines

Conceptually, you can think of interlaced 480i as the following:

T+0/60sec = the 240 odd scanlines T+1/60sec = the 240 even scanlines T+2/60sec = the 240 odd scanlines T+3/60sec = the 240 even scanlines T+4/60sec = the 240 odd scanlines T+5/60sec = the 240 even scanlines

Etc.

Since interlaced was designed in the analog era where scanlines can be arbitrarily vertically positioned anywhere on a CRT tube -- 8-bit-era computer/console makers found a creative way to simply overlap the even/odd scanlines instead of offset them (between each other) -- via a minor TV signal timing modification -- creating a 240p mode out of 480i. But 240p and 480i still contains exactly 60 temporal images of 240 scanlines apiece, regardless.

Note: With VBI, it is sometimes called "525i" instead of "480i"

Terminologically, 480i was often called "30 frames per second" but NTSC/PAL temporal resolution was always permanently 60 fullscreen's worth of scanouts per second, regardless of interlaced or progressive. "Frame" terminology is when one cycle of full (static-image) resolution is built up. However, motion resolution was always 60, since you can display a completely different image in the second field of 480i -- and Sports/Soap operas always did that (60 temporal images per second since ~1930s).

Deinterlacers may use historical information (the past few fields) to "enhance" the current field (i.e. converting 480i into 480p). Often, "bob" deinterlace are beam racing friendly. For advanced deinterlacing algorithms, what may be displayed is an input-lagged result (e.g. lookforward deinterlacer that displays the intermediate middle combined result of a 3-frame or 5-frame history -- adding 1 frame or 2 frames lag). Beam racing this will still have a lagged result like any good deinterlacer may have, albiet with slightly less lag (up to 1 frame less lag).

Now, if there's no deinterlacing done (e.g. original interlacing preserved to output) then deinterlacing lag (for lookforward+lookbackward deinterlacers) isn't applicable here.

Emulators typically generally handle 480i as 60 framebuffers per second. That's the proper way to do it, anyway -- whether you do simple bob deinterlace, or any advanced deinterlace algorithms.

I used to work in the home theater industry, being the moderator of the AVSFORUM Home Theater Computers forums, and have worked with vendors (including working for RUNCO as a consultant) on their video processor & scaler products. So I understand my "i" and "p" stuff...

If all these concepts this is too complicated, just add it as an additional condition to automatically disable beam racing ("If in interlaced mode instead of progressive mode, disable the laggy deinterlacer or disable beam racing").

Most retro consoles used 240p instead of 480i. Even NTSC 480i (real interlacing) is often handled as 60 framebuffers per second in an emulator, even if some sources used to call it "480i/30" (two temporal fields per frame, offset 1/60sec apart).

Note: One can simply visually seamlessly enter/exit beamracing on the fly (in real time) -- there might be one tiny microstutter during the enter/exit (1/60sec lag increase-decrease) but that's an acceptable penalty during, say, a screen rotation or a video mode change (most screens take time to catch up in mode changes anyway). This is accomplished by using one VBI-synchronized full buffer Present()s per refresh (software-based VBI synchronization) instead of mid-frame Present()s (true beam racing). e.g. during screen rotation when scanout directions diverge (realworld vs emu scanout) but could include the entering/exiting interlaced mode in the SNES module, if SNES module is chosen to be the two first modules to support beam racing as part of the bounty requirements. Remember, you only need to support two emulator modules to claim the bounty. If you choose an SNES module as part of the bounty, then the SNES module would still count towards the bounty even if beamracing was automatically disabled during interlaced mode (if too complex to wrap your head around it).

For simplicity, supporting beam racing during interlaced modes is not a mandatory requirement for claiming this bounty -- however it is easy to support or to add later (by a programmer who understands interlacing & deinterlacing).

blurbusters commented 6 years ago

Formerly someone (Burnsedia) started working on this BountySource issue until they realized this was a C/C++ project. I'm updating the original post to be clear that this is a C/C++ skilled project.

m4xw commented 6 years ago

@Burnsedia Your past track record on bountysource came to my attention, you marked 5 bounties as "solving", yet all of them are still open. Since I expect you to have a solid understanding of C and the required knowledged of how graphics API's work internally, could you please elaborate on how you would implement this feature? If you can't answer this, I will need you to refrain yourself from taking our bounties on, as I fear you could lock up high value bounties for no reason - effectively stalling progress on this or other bounties.

TheDeadFish commented 6 years ago

Has anyone tried this on Nvidia 700 or 900 series cards. I have had major issues with these cards and inconstant timing of the frame-buffer. The time at which the frame-buffer is actually sampled can vary by as much as half a frame making racing the beam completely impossible.

The problem stems from an over-sized video output buffer and also memory compression of some kind. As soon as the active scan starts the output buffer is filled at an unlimited rate (really fast), this causes the read position in the frame-buffer to pull way ahead of the real beam position. The output buffer seems to store compressed pixels, for a screen of mostly solid color about half a frame can fit in the output buffer, for a screen of incompressible noise only a small number of lines can fit and therefor has much more normal timing.

This issue has plagued my mind for several years (I gave my 960 away because it bothered me so much), but I have yet to see any other mentions of this issue. I only post this here now because its relevant.

blurbusters commented 6 years ago

Bountysource increased to $1142.

ghost commented 5 years ago

Someone should close this issue and apologize to bakers.

inactive123 commented 5 years ago

@casdevel I'm sorry, why? What we need instead is for a bounty hunter to take this up, or for more people to fund it.

You did good work in the past, so I legitimately am dumbfounded by this response.

What we need instead is some optimism here and somebody who would like to see this happen bring it into fruition.

ghost commented 5 years ago

I'm just sharing my opinion and trying to be a realistic, word "lag-less" shouldn't exist in programmer vocabulary.

inactive123 commented 5 years ago

OK, I'll rewrite it to 'beam racing' then. Or perhaps 'scanline sync' is a more appropriate term.

For reference, Riva Tuner (RTSS) has an implementation similar to this, it's called Scanline Sync.

inactive123 commented 5 years ago

Renamed. Here is more info about Scanline Sync - https://forums.blurbusters.com/viewtopic.php?f=10&t=4916

BartTerpstra commented 5 years ago

this was a fascinating read. is the title still accurate?

robertos677 commented 5 years ago

Yes, it can work simultaneously with RunAhead (if need be, though not necessary). Simply beam race the final/visible frame.

Simply doing that negates the advantages of raster syncing, though.

To properly combine them, you would do this (ignoring the v-blanking period for simplification):

Run the emulator for f scanlines; save state Emulate N frames, and beamsync the last f scanlines; load state

where N is the runahead value and f is the number of scanlines per frameslice.

This effectively multiplies the amount of work you have to do per frame by the number of visible frameslices per emulated v-blanking interval. So if the screen is divided into four frameslices, you do four times the work you would do with runahead alone.

Essentially, the runahead algorithm remains unchanged, but we operate on scan lines as opposed to discrete frames. You could even make it possible to specify runahead values in scan lines.

blurbusters commented 5 years ago

this was a fascinating read. is the title still accurate?

Title is no longer accurate, I'll edit. Someone retracted their share of BountySource, alas. My donations/contributions remain valid though. Please refer to the BountySource link for the most updated already-donated pot.

Once someone "Starts" the project, it locks the bounty until completion (if within reasonable time).

blurbusters commented 5 years ago

This effectively multiplies the amount of work you have to do per frame by the number of visible frameslices per emulated v-blanking interval. So if the screen is divided into four frameslices, you do four times the work you would do with runahead alone.

Very interesting concept to do sub-frame runahead! Yes, this would potentially work, but a bit complicated considering beam racing is a bit complicated by itself for some.

However, practically: -- It'd be simpler to keep runahead at frame intervals -- it'd be simpler to keep beamracing as its own sync mode -- But work in a way that allows each other to layer on top of each other (getting even further reduced latency by combining runahead+beamracing, but only up to subframe-reductions)

One big raison d'etre of beamraced sync is that it's very scaleable from low-end to high-end. It's just a simply a matter of upgrading/downgrading your raster timing precision and frameslice count. Even the lowest-end Android GPUs can handle beamraced sync when the parameters are configured accordingly.

blurbusters commented 5 years ago

I'm just sharing my opinion and trying to be a realistic, word "lag-less" shouldn't exist in programmer vocabulary.

As Einstein says, "It's all relative". Lagless is relative to original hardware as a differential. As in emulators that add no extra lag relative to original machine.

Assuming you do 15,625 frameslices per second (beamrace at NTSC scanrate), using single-scanline frameslices output directly to a front buffer (bypassing VSYNC OFF), even going to per-pixel inside-scanline (which is possible with some frontbuffer architectures), with no jitter safety margin -- it essentially becomes virtually 0 difference between emulator and original machine. Software based emulators achieving FPGA latency symmetry. In theory anyway.

I donated several hundred dollars in this bounty prize that still remains open. Googling "lagless VSYNC" brings up this item. By renaming it, you affect something that I contributed money to -- so I have done a compromise-renameback to preserve a reasonable modicum of Google SEO...

Even though GPUs cannot achieve that many frameslices per second or synchronous updates per second at such precisions, you'll be able to get closer and closer to original machine. Some GPUs manage to pull of near per-line accuracy, as already seen in the videos.

Your comment came more than a year after this was already created. There is already Google SEO and media articles that use "Lagless VSYNC" nomenclature. Although the prevailing terminology is now "Beam Raced Sync" or "Scanline Sync" they are synonyms to "Lagless VSYNC" as being the mathematically lowest-possible-lag-penalty Direct3D/OpenGL workflows for raster-based emulators that uses direct emulation (no runaheads).

Blur Busters was the one who convinced Guru3D to add the Scanline Sync mode to RTSS, and it has became a popular alternate sync mode in lowering input lag, when used as the right tool for the right job. Alas, that is framebuffer based (like 1-frameslice beamracing), so doesn't reduce lag to subframe levels.

For emulator frameslice beamracing -- achieving the zero differential feels impossible like trying to reach the speed of light, but one can get closer [5ms, 1ms, 0.5ms, 0.1ms....] to latency symmetry with original machines. Frame-granularity emulation can never do that.

While the terminology "Lagless VSYNC" may now be deprecated in favour of other more popular terms like "Scanline Sync" -- it does requires qualifications (approach zero relative to original -- in "Einstein is relative" manner). It is the only mathematical path that can approach zero differential between original and emulator.

Beamraced sync is the most mathematically perfect way for traditional software-based emulators to achieve latency symmetry with original machines via single-pass execution [i.e. no runahead].

blurbusters commented 5 years ago

Oh, and by the way, use the power management API to turn off power management during beam racing. Power management interferes a lot -- LOT. I find beam racing improves a lot if power management is turned off. It's also possible to self-monitor when beam racing is erratic and do corrective actions (e.g. notify user of poor quality sync between emuraster / realraster).

Now, that doesn't mean always disable power management. Just provide as an option for high-quality beam racing. For the low-power route, if you use low-granularity frameslices and have 1ms accurate timers, even 4-frameslice or 6-frameslice beamracing is still precise enough with low-quality 1ms timers (A 60Hz refresh cycle is 16.7ms, and a single frameslice is a quarter of that -- about 4ms. Doable with low precision 1ms timers)

Beam racing can be reliable even on Raspberry PI when tweaked properly. WinUAE beamracing works reasonably well on very unpowered laptops with 10-year-old Intel GPUs when configured properly. It uses less CPU than runahead, so beamraced sync is a lower-processing-cost lag lowering technique if you find an equilibrium (e.g. low frameslice count such as 4 to 10). Trying to do frameslice-level runahead will more than eliminate this capability.

Getting ultralow-lag out of embedded GPU/CPUs or processor-heavy for taxing emulator modules is easier with beamraced sync only (at low frameslice granularity, or by enabling front-buffer mode.

Properly done, beamraced sync only adds less than 1.1x to 1.5x processing power. Basically needing only 10% to 50% more CPU to beamrace (you can even turn off busywaits and simply use millisecond timers as described above! That's less than even basic runahead! Assuming you properly adjust to an equilibrium. The precision can be selectable: Generic 1ms Timer Event | Precise Timer Event | Ultra Precision Busywaits ..... Even 4-frameslice coarse-granularity beam raced sync, easily achieved using plain old fashioned 1ms timers, is still subframe latency achieved without runahead. To help out laptop memory (shared memory), you can do partial-framebuffer blit flags, so instead of hammering the RAM, you just blit only the framesliceful region.

Low lying apple tricks can turn beamraced sync into a low-overhead operation. Blitting 4 fractions of a refresh cycle is the same amount of bytes transferred as one full framebuffer full; many graphics drivers let you do that too. There's flags that already exist in Windows drivers. This keeps things gentle on shared RAM, to the point where even an old Intel GPU was successfully able to do 10-frameslice-count WinUAE (600 frameslices per second), including enough processing room left to add a CRT filter.

There's flags to make Present() cause only a partial memory transfer between the software and the display RAM -- flags already exist. That works to our favour in transferring beamraced frameslices on slow shared-memory systems. So 600fps is really 600 tenth-frames per second instead, with the same number of bytes transferred, you know!

In my tests, with low-frameslice-granularity emuraster-realraster sync, it definitely uses less overhead than runahead, so this could be excellent lag-eliminator for Android boxes and low-cost emulators, should someone take upon this project.

Runahead and beamraced-sync both have their respective pros/cons, but having sub-2x overhead isn't one of runahead's advantages which has limited the ability to use runahead on emulator modules that use 60%-70% CPU (i.e. on lower end CPUs like Raspberry PI CPU, or a powerful cycle-exact emulator). In that situation you need beamraced sync as a practical purist method of reducing input latency with less CPU overhead.

GPU power management tip: Make sure you don't have more than approximately 0.5ms to 1ms of idle time between frameslices. For some reason, emulators tend to aggressively trigger GPU power management, because GPUs love to sleep for many milliseconds (>8ms) at such random times when it detects an idle moment. This totally messes up beamraced sync. Setting "Performance Mode" completely solves this. But often "Balanced Mode" is sometimes enough /provided/ you use enough frameslice count. Basically, adjust your frameslice count to a goldilocks count. Using too low a frameslice count can cause power management to automatically occur (adding many milliseconds of mis-sync between emuraster and realraster). This unexpected behaviour creates an addition to best practices: Don't use too low a frameslice count if your hardware is capable enough. (Even increasing frameslice count from 4 to 6 to 8 on a 10-year old GPU sometimes improved things!). You could even design things to make frameslice count dynamic, automatically adapt to your hardware capabilities. GPU/CPU low %....automatically increase frameslices. CPU/GPU high %....automatically decrease frameslices. Frameslices don't even have to be exact same sizes. The frameslice can be a varying thickness throughout the refresh cycle, even! The chasebehind would simply be a scanlines-number offset, that is roughly optimized to prevent artifacts from appearing (a slider can easily adjust this vertical chase distance between emuraster and realraster ... you'd simply eyeball for artifacts during horizontal scrollers, to calibrate for reasonably low artifacts-free lag). Anyway, this is just an idea. For now, keep things simple, but just saying...

Benchmarking efficiency of beamraced sync is extremely hard because a lot of the horsepower is consumed by memory transfers of high-rate Present() as well as the busywaits. But all of that is optional

Parameters for Low-Overhead Beam Raced Sync

(for mobile, Raspberry PI, and high-%-CPU emulator modules)

Configurable Option: Old fashioned timer mode (1ms precision) instead of high precision timer or busywaits. Suitable for low-frameslice beamracing
Configurable Option: Lower frameslice granularity to between 4 frameslices to 10 frameslices.
Configurable Option: Make Present() or glxxSwapBuffers() do partial framebuffer transfers. Good for shared-memory GPUs, so bytes/sec blitted CPU->GPU is unchanged regardless of frameslice count.
Configurable Option: Jitter margin (Frameslice chasebehind) configured to approximately 2/10th refresh cycle or 3/10th refresh cycle (the vertical distance between emuraster-realraster) for 10 frameslice beam racing. This gives more jitter safety margin for imprecise timers (1ms timers), while still achieving really-subframe latency.
Configurable Option: Moderate control over the "Disable power management" API. Or at least automatically detect problematic power management modes.

Note about power management modes: Battery Saver Mode will kill beam racing precision, but Balanced mode will work on some hardware, while CPU 100%, GPU 100%, High Performance Mode on highend GPUs. Mobile processors can get latency-symmetry to originals/FGPA engines to within approximately ~3.2ms error (10 frameslice at 2 frameslice jitter margin), while high end desktop GPUs on i7 can achieve latency-symmetry to originals/FPGA engines to within less than 1ms error.

Once configured to these parameters, I've seen CPU overheads surge to only approximately 1.1x

Aside: Hmmmmm (sudden lightbulb idea). Although this wasn't why I created this item. I wonder if a profit motive by a portable emulator hardware developer would be willing to take upon this project to gain near-FPGA-league latency in RetroArch or RetroPIE on more heavy emulators on cheaper CPU/GPU hardware. If they knew, they'd dump hundreds on this github bounty in a "Take My Money" rush. If only the emulator hardware manufacturers knew how valuable this github item was! Running more powerful emulators at ultralow lag on cheap-BOM devices that can sell at bigger profit margins. Maybe some of you want to ask them to contribute to this BountySource? ;) ...

I continue to be impressed at how scalable up/down beamraced sync is. It's surprising horsepower-heavy at some settings, but surprisingly horsepower-miserly when optimized with some compromise settings. Brilliant as Runahead is, using Runahead can never be as low-overhead as efficiency-optimized beamraced sync!

TL;DR: Beam raced sync can use only 10% more CPU overhead than plain old VSYNC ON, with the right optimizations.

mdrejhon commented 4 years ago

Update: This issue is owned by me.

The ghost account is because when I switched from personal account to a corporate account, this original account became 'ghost'. Ugh -- an unintended consequence.

inactive123 commented 4 years ago

Hi there, I'm still trying to find willing bounty hunters able to take on this task. It's still on my radar and I'm still trying to sweeten the pot for whichever would-be developer would like to take on the challenge.

mdrejhon commented 4 years ago

I noticed that a few did withdraw their share of bounty, but my bounty certainly still remains --

Update, @TomHarte added beam racing to the Mac branch of his crossplatform CLK emulator https://arstechnica.com/civis/viewtopic.php?p=38773471#p38773471

He uses the technique of a time offset between CVDisplayLink vertical timing events, to guesstimate the raster to an accuracy margin sufficient enough for 600-frameslice beam racing (sub-refresh-cycle latency) for the majority of Macs he has tried it on.

I'm willing to propose simplifying the requirements (to remove Mac and Linux for now) if nobody else contributes to the BountySource. However, it should be architectured in a way that future Linux and Mac support should be easily added.

TomHarte (Thomas Harte) successfully added beam racing to the Mac branch of his cross-platform emulator CLK, so there's already some sample code available.

mdrejhon commented 4 years ago

I am cross posting some messages from the Ars comments section:

Why beam raced sync latency is much more accurate than RunAhead latency: https://arstechnica.com/civis/viewtopic.php?p=38773166#p38773166

And recommended simple method of debugging beam racing (striped strips on border): https://arstechnica.com/civis/viewtopic.php?p=38773501#p38773501

The benefit of putting this type of work into more emulators is FPGA-league latency identicalness.

Even RetroArch's RunAhead (ArsTechnica article) which is amazing, can't achieve faithful latency like FPGA, due to things like latency-nonlinearity. For RunAhead, emulated scan rate in RunAhead is always faster than realtime, generating rasters faster than realtime to offscreen frame buffers. This distorts time between mid-screen inputreads (sawtooth latency effects at 60 sawtooths per second, distorting latency extremes to different points of the 1/60sec = 16.7ms time windows), even if you manage to match average latency to original machine or FPGA machine. Also Game A may inputread at end of VBI, and another Game B may inputread at beginning of VBI. And Game C may inputread at raster # 002 or # 199 or whatever. So input lag differentials between a RunAhead emulator and the original machine (at same RunAhead setting), will vary in a window of [0..16.7ms] because of the latency distortions within the RunAhead algorithm. Compare this 16.7ms of lag nonlinearity to the 1ms lag-behavior symmetry achieved by WinUAE's GPU-beamraced sync.

With RunAhead and photodiode oscilloscope, one can attempt calibrate lag of one game to match original machine. But a different game will diverge in lag from original machine as a result of that original calibration. So worst-case lag non-faithfulness difference is 16.7ms between Game X (vs original machine) and Game Y (vs original machine) with identical RunAhead settings. So RunAhead can't create universal latency faithfulness. It's a stunning emulator innovation, but latency purists know it does not duplicate FPGA latency behaviors.

Thus, while RunAhead is an amazing invention, it can never replicate "original input latency" or replicate "FPGA league latency" in a software emulator as well as beam raced sync algorithms (emuraster-realraster synchronization).

So there are now PC emulators and Mac emulators using beamraced sync now,

TomHarte commented 4 years ago

Sorry, to be fully clear: I've implemented all the parts of raster racing, but not yet flipped the switch for a combination of factors. But, of specific interest:

I[use dispatch_source_set_timer for timing, feeding a dedicated dispatch queue. It not only takes a nanosecond-precision period but actually seems to do a pretty good job of honouring it, at least if comparing to std::chronological::high_resolution_clock is valid.

CVDisplayLink provides retrace notifications, naturally. It provides a retrace period and a frequency; the only thing I initially got wrong was not making sure to create a new link each time my application moves to a new display. I have both my laptop's built-in display and an external monitor, with different display rates; using CVDisplayLinkCreateWithActiveCGDisplays to create "a display link capable of being used with all active displays" gave me a synthetic timer not actually tied to either display's retraces.

The process I settled on was just setting the dispatch source to a moderately-high number; each timer window is treated as a single discrete time step except for those in which a display-link callback has fallen. Those are split into two parts, before the callback and afterwards.

If the display's rate and emulated machine's rate are sufficiently compatible then a phase-locked loop attempts to pull the emulated machine's vertical sync into phase with the host machine's. That's a permanent, ongoing process that occurs purely through observation of the video signal because several of the machines I attempt to emulate have variable, programmatic output rates.

At present I have chickened out and I just do a buffer copy at the identified divide point in the relevant timer window. For fixed-precision beam racing I'd just need to switch to doing a copy at the end of every timer window and use a fixed offset so that my PLL drags the emulated machine into phase one timer window ahead.

The reasons I haven't yet done are objectively insubstantial — I certainly don't see any technical barriers on macOS.

mdrejhon commented 4 years ago

UPDATE

I'll now permit this alternative feature addition to claim the BountySource pot, if a programmer thinks this method is easier. If other BountySource donators agree, we can expand the qualifying criteria to choose this alternative approach (~$500 BountySource claim at https://www.bountysource.com/issues/60853960-add-beam-racing-scanline-sync-to-retroarch-aka-lagless-vsync ...)

Beam Racing Concept: Temporally Emulate a CRT Electron Gun Too, Not Just Spatially.

(And optionally, simultaneously beam race it -- basically using brute Hz as a coarse emuraster=realraster sync method)

Long-term, I’d like to see some emulators start to consider temporally emulating an electron gun. The sheer brute-force of refresh cycles (240Hz, 360Hz) can be used to create a granular CRT electron gun emulation.

Also, I posted a suggestion about a future “software-based rolling scan” for 240Hz and 360Hz monitors at the GroovyMAME forum – aka Temporal HLSL where you use the brute refresh rate to emulate a CRT electron gun at sub-refresh levels, but I should probably create a new forum thread for it. I also posted a issue at the MAME GitHub too as well.

High refresh rate HDR displays are good for 60Hz emulation because of:

Lower input lag
Better BFI (because the increased hertzroom improves quality of low-Hz software BFI)
Opportunities to emulate CRT electron gun via rolling-bar BFI
HDR creates brightness headroom for the dimming of CRT emulation
Software-based rolling-bar emulation can be beamraced (in sync with emulator raster)

However, I think this should become a new RetroArch issue being open too, as a long-term incubation. These are two separate issues (one or both of which I'm willing to finance incubation of, as a fan of emulators).

Theoretically, it is easier to implement than beamraced VSYNC, since we only need to worry about the display at the full refresh cycle level (plain ordinary old-fashioned VSYNC). The refresh rate race to retina refresh rates are producing a boom of high-Hz monitor.

We’re looking forward to the upcoming DELL 360Hz IPS monitor (AW2521H without the F suffix), which will allow high-quality 6-segment rolling bar emulation of a CRT electron gun.

The same proposed “retro_set_raster_poll” could still be added to RetroArch, to benefit this initiative too (not just beamraced VSYNC). Since that’s a univeral API for futureproofed beamracing techniques including this alternative “beam racing via brute Hz” approach.

Another bonus, software-based rolling-bar BFI doesn't need to care about display scanout direction, so it'd work on all display rotations.

If one is smart enough to architecture well... The frameslice beamracing workflow can be made futureproof.

Modular enough to output to either a hardware beamracer (e.g. VSYNC OFF frameslice beamracing the real raster) or software beamracer (e.g. multiple real refresh cycles per emulator refresh, including rolling-bar BFI).

That's why we need a retro_set_raster_poll API to be added to RetroArch, even pre-emptively (even without adding beam racing support yet). It opens-up an entire universe of possible real-world beamracing temporal preservation methods.

TomHarte commented 4 years ago

I suspect we've strayed beyond where any input from me is helpful, and I don't think I've fully implemented what is being asked, but in CLK all machines output a linear video stream which a virtual CRT transposes into 2d by the usual means of sync separation plus PLLs, so that exits the machine as a list of 2d raster scans with 1d streams of data attached. The whole thing is rendered as geometry, being at least one quad per line of output.

The scans are fed out in real time, I haven't done anything to make frames atomic — it's only if the host and emulated refresh rates are compatible and the two syncs are nudged into phase that you get something like traditional indivisible frame output.

However I deviate from what mdrejhon describes in that I blend each set of new scans on top of the old, because I was primarily fixated on the 50Hz @ 60Hz scenario rather than e.g. 60Hz @ 240Hz. So I mentally phrased this as motion aliasing versus softness and went with softness. You can see some tearing in high-speed 50Hz games but it's less offensive than it might have been because it's not a hard tear, and also avoids the extreme latency that would otherwise accrue if I sometimes effectively held back 50Hz input for two complete 60Hz frames.

I agree it would be smarter when output rates are much higher than the host machine to skip the blending — especially if/when wider dynamic ranges are available, and subject to having enough buffered that you can pause emulation on a complete display, of course. I'm not sure we'd necessarily agree on blending as I currently use it, but that is what I have currently implemented.

inactive123 commented 4 years ago

@TomHarte Would you be interested in taking on the bounty? We could add $200 to it as an additional sweetener.

mdrejhon commented 4 years ago

of course. I'm not sure we'd necessarily agree on blending as I currently use it, but that is what I have currently implemented.

Actually, this is still useful information!

In an ideal world, blending is not normally necessary for hardware-based VSYNC OFF frameslice beamracing, if you use the jittermargin technique. (e.g. emulator-plotting the emulator raster ahead of real-raster). The blending never becomes visible, as long as the emulator raster stays ahead of real raster. In other words, we're simply using high-framerate VSYNC OFF as a stand-in for a front-buffer.

That said, blending might reduce the "tearing artifacts during computer slowdown" situation. If the real raster runs ahead of emulator raster, then you get VSYNC OFF tearing artifacts. So alphblending the boundaries of frameslices of refresh cycle ontop of the old refresh cycle, can in theory soften/reduce these "beamrace failure" artifacts. If you're finding blending mandatory because artfacts continually happen, then that's a problem to diagnose, too, in addition.

As a debug-assister, blending should be an option that can be temporarily disabled, so that things can be tweaked so that the beamraced sync is good enough that blending never helps. So the non-blending becomes a good beamrace debugger. If artifacts disappear and blending-vs-nonblending looks identical, emuraster is correctly permanently staying ahead of realraster.

That said, however blending would be absolutely mandatory if you use sheer-Hz VSYNC ON beamracing (e.g. rolling-scan beamracing of 60Hz emulator onto 240Hz LCD). Basically present the full frame buffers, which the display has to scanout anyway.

I think far ahead of most people, and just want to conceptualize a generic cross-platform beamraced frameslice delivery mechanism that supports both hardware-based beam racing (syncing emuraster to realraster) and software-based beam racing (software based rolling scan piggybacking on sheer Hz)

Heck, in fact, you could do both simultaneously anyway, so the venn diagram of hardware-based beam racing can overlap the software-based beam racing! During realHz=emuHz situation, you'd flywheel the sync appropriately and it's hardware beamraced sync, and blending becomes redundant. During "realHz far above emuHz" situation, you're pure software-based beamracing to the Hz granularity, and using the alphablend to hide the scanout seams between the refresh cycles of destination Hz.

TomHarte, is that what you're actually already doing? If so, then that's freaking brilliant. It's like a rolling-scan emulator, except without the blackframe portion being added yet (not yet emulating a CRT electron gun phosphor fade).

So just refinements+merger to ideas really. You might need to tweak it a bit, to add some jitter-margin awareness + adjustable-height blend gradient (vertical dimension), so that it goes tearingless / blendless when the flywheel sync is within the beamraced jitter margin above the blendarea. And less objectionable tearing artifacts when it does emuHz-vs-realHz does go out of sync.

I think I need to essentially rewrite the Beamraced VSYNC textbook, so that one beamrace approach could potentially covers all.

However I deviate from what mdrejhon describes in that I blend each set of new scans on top of the old, because I was primarily fixated on the 50Hz @ 60Hz scenario rather than e.g. 60Hz @ 240Hz

Actually, this isn't mutually exclusive! And a perfectly fantastic idea. See the "Concept of Hz-Agnostic Rolling Scan BFI" section at http://forum.arcadecontrols.com/index.php/topic,162926.0.html (scroll down to post #9) ... Except you're simply doing full persistence without a black fadebehind.

It would scale as little as you wish (like 50fps @ 60Hz blending) and scale as far as you wish (like 60fps at 1000Hz), without requiring the source Hz and destination Hz to be divisible.

Concept of Hz-Agnostic Rolling Scan BFI (CRT Scanning Emulation)

Situation Example of 60Hz CRT emulation onto a 200Hz LCD

Emulator Refresh Cycle 1 ....Real Refresh 1: full 60/200th height bar (30% screen height), at 0%-30% vertical position ....Real Refresh 2: full 60/200th height bar (30% screen height), at 30%-60% vertical position ....Real Refresh 3: fuill 60/200th height bar (30% screen height), at 60%-90% vertical position ....Real Refresh 4: 1/3 of 60/200th height bar (10% screen height), at 90%-100% vertical position

Emulator Refresh Cycle 2 ....Real Refresh 5: 2/3 of 60/200th height bar (20% screen height), at 0%-20% vertical position ....Real Refresh 6: full 60/200th height bar (30% screen height), at 20%-50% vertical position ....Real Refresh 7: full 60/200th height bar (30% screen height), at 50%-80% vertical position ....Real Refresh 8: 2/3 of of 60/200th height bar (20% screen height), at 80%-100% vertical position

Emulator Refresh Cycle 3 ....Real Refresh 9: 1/3 of 60/200th height bar (10% screen height), at 0%-10% vertical position ....Real Refresh 10: full 60/200th height bar (30% screen height), at 10%-40% vertical position ....Real Refresh 11: full 60/200th height bar (30% screen height), at 40%-70% vertical position ....Real Refresh 12: full 60/200th height bar (30% screen height), at 70%-100% vertical position

In this situation, you don't have to worry about the hardware raster position. Just worry about ordinary 200fps VSYNC ON, just simply making sure there's no framedrops, and emulate your beamrace into that, while appropriately alpha-blending the seams (as TomHarte says). At sheer refresh rates like 360Hz, you can permit a little extra frame queue depth (e.g. 2 frames) to de-stutter any erratic computer performance, knowing it's a mere 2/360sec latency penalty. Since the effort of alphablending two frames is really fast (my GPU can do it about 1000 times a second).

For full persistence, all bars are illuminated (like emulator 50Hz on real 60Hz, or vice-versa), but for reduced persistence, you'd add blackness to create the rolling-bar black frame insertion, and appropriately alphablend from the frameslice towards black (Rather than alphablend to previous refresh cycle).

For bad mismatches, you don't have to bother to flywheel-sync the real-raster and emu-raster. For close matches (e.g. 59.94fps for a 59.97Hz display), it would be perfectly fine to flywheel sync (and speedup/slowdown audio very fractionally).

Essentially, the flywheel enable/disable feature would essentially glue together both the Lagless VSYNC approach and the Beam-Raced Temporal HLSL approach.

So the TomHarte approach and BlurBusters approach is mergeable! Beamracing via sheer Hz is much more crossplatform.

Proposal: Easier BountySource Claim Flexibility

I now propose that the bounty qualifies for ANY kind of beamraced output, (A) The TomHarte approach of full persistence Hz-Agnostic rolling scan beam race method (B) The Blur Busters Lagless VSYNC Concept (C) The Beam Race Via Sheer Hz Concept (D) The Merged Version Thereof (BFIv3 "Temporal HLSL" that's also optionally capable of full persistence rolling-scan sample-and-hold, ala TomHarte)

The venn diagram of all the above, is essentially mergeable. But I would propose to prevent complicating things for the BountySource, I'd say, (A) or (B) or (C) or (D) alone, is sufficient enough to qualify for the bounty. Thoughts?

@TomHarte Would you be interested in taking on the bounty? We could add $200 to it as an additional sweetener.

Either way, for RetroArch, the first step is a retro_set_raster_poll hook, which will unlock any beamraced-output approaches, whether it be hardware-dependent or pure-software. Would you be able to add that?

Preprogramming this simple hook to would make it easier for anybody to add any beam raced output approachs to RetroArch.

mdrejhon commented 4 years ago

Here you go:

#10757 (BFIv3) Emulate a CRT Electron Gun Via Rolling-Scan BFI

Theoretically, both this GitHub item (#6984) and the BFIv3 (#10757) can essentially become an identical task.

This may be helpful for people who find BFIv3 conceptually easier to program than this GitHub item, though it would need a rolling full-persistence option too.

If programming from that angle, and later adding this GitHub item as a subset of #10757, perhaps using a @TomHarte-derived flywheel sync algorithm that triggers only when emuHz-realHz is close enough.

mdrejhon commented 4 years ago

Task Breakdown Simplification:

GitHub item #10758 is a pre-requisite for this.

I broke out the retro_set_raster_poll pre-requisite separately, because it's a universal requirement for all possible beamraceable output techniques.

jayare5 commented 3 years ago

Hey, I'd love to contribute some money to the bounty! But I see that it hasnt had anything added since 2018 and Im feeling hesitant, Is it worth doing it? Also it would be cool to promote in some way, I'm surprised I don't hear more people talking about it!

mdrejhon commented 3 years ago

Hey, I'd love to contribute some money to the bounty! But I see that it hasnt had anything added since 2018 and Im feeling hesitant, Is it worth doing it? Also it would be cool to promote in some way, I'm surprised I don't hear more people talking about it!

It's still a valid bounty. Most of the funds are mine -- and this bounty will be honored.

There was a bit of talk about it in 2018, but currently quiet on these fronts at the moment.

The buzz can be restarted at pretty much any time, if a small group of us of similar interests can start a buzz campaign about this. Some of us have jobs though, or got affected by pandemic affecting work, and have to work harder to compensate, etc. But I'm pretty much 100% behind seeing this happen.

BTW, the new "240 Hz IPS" monitors are spectacular for RetroArch (even for 60Hz operation).

johnnovak commented 2 years ago

I find it so weird that there aren't dozens of devs jumping at the opportunity to implement this... More than 4 years have passed since this ticket was created and still no working implementation?! Huh?!

Input lag is one of THE most pressing issues that needs addressing in emulators, and WinUAE has proven that this technique works extremely well in practice. With the "lagless vsync" feature enabled in WinUAE with a frame-slice of 4, I really see zero reason to bother with real hardware anymore. The best of all — it works flawlessly with complex shaders! It's a huge game-changer, and I'm quite disappointed that developers of other emulators are so incredibly slow at adapting this brilliant technique.

For the record, I don't care about RetroArch at all, otherwise I'd be doing this. But I started nagging the VICE devs about it; their C64 emulator badly needs it (some C64 games are virtually unplayable with the current 60-100ms lag). Might follow my own advice and will implement it myself, eventually...

LibretroAdmin commented 2 years ago

This bounty is solely for a RetroArch implementation.

We also regret that nobody has picked this up yet. We have tried funding it with money, clearly that is not enough. It has to come from the heart from someone passionate enough and capable to do it.

mdrejhon commented 2 years ago

Yes. WinUAE has led the way, having already implemented this.

Someone needs to add retro_set_raster_poll placeholders (see #10758).
Then this task becomes much simpler.

As a reminder to all -- this techinique is really truly the only way to organically get universal organic original machine latency in an emulator (universal native-machine / FPGA-league latency originality). VSYNC OFF frameslice beam racing is the closest you can get to raster-plotting directly to front-buffer, one row at a time, in real time, in sync with the real world raster.

Same latency as original machine, to the error margin of 1-2 frameslices (subrefresh segments). Some of the faster GPUs can now exceed 10,000 frameslices per second.

We are rapidly approaching an era where we may be able to do full fine granularity NTSC scanrate too! (1-pixel tall VSYNC OFF frameslices -- e.g. each pixel row is its separate tearline)

johnnovak commented 2 years ago

Yes. WinUAE has led the way, having already implemented this.

Someone needs to add retro_set_raster_poll placeholders (see #10758). Then this task becomes much simpler.

Talked to the VICE people today about it. They're considering it, but some large scale refactorings will come first, which might take years.

LibretroAdmin commented 2 years ago

I'd like to at least start implementing some of the auxiliary things which would be needed to get the whole thing going.

Thankfully blurbusters provided a lot of documentation and I feel like it should be possible to maybe break up all that has to be done into chunks. If we get some of these chunks done, even without a working implementation the entire thing might not seem so daunting to do.

hizzlekizzle commented 2 years ago

As I've mentioned elsewhere, I believe one of the major hurdles for RetroArch/libretro in implementing this is that we typically work in full-frame chunks. That is, the core runs long enough to generate a frame's worth of audio and video, then passes it to the frontend. For this, we'll need to pass along much smaller chunks and sync them more often.

I suspect the cores that already use libco to hop between the core and libretro threads are probably going to be the lowest-hanging fruit. IIRC someone (maybe RealNC?) tinkered with this awhile back unsuccessfully, but I don't recall what exactly fell short.

mdrejhon commented 2 years ago

As I've mentioned elsewhere, I believe one of the major hurdles for RetroArch/libretro in implementing this is that we typically work in full-frame chunks. That is, the core runs long enough to generate a frame's worth of audio and video, then passes it to the frontend. For this, we'll need to pass along much smaller chunks and sync them more often.

I suspect the cores that already use libco to hop between the core and libretro threads are probably going to be the lowest-hanging fruit. IIRC someone (maybe RealNC?) tinkered with this awhile back unsuccessfully, but I don't recall what exactly fell short.

That's exactly what the retro_set_raster_poll is designed to do. Please look at #10758. I've already addressed this.

Several emulators (e.g. NES) already render line-based.

We simply need to add callbacks there, and it will be highly configurable for the future, with any one or more of the following:

Optional frameslice beam racing (very easy, already done by WinUAE)
Optional line based beam racing
Optional long term future CRT simulators (e.g. shaders that simulate a CRT electron beam in real time (e.g. using 8 refresh cycles on a 500Hz monitor to simulate one 60Hz CRT refresh cycle in a rolling-scan style BFI with simulated phosphor fadebehind)- #10757

I actually already spent dozens of hours researching RetroArch's source code. It's simpler thank you think. The first step is adding the raster scan line callback to the existing RetroArch callback APIs -- header it out to template it in, even if no module is "activated" yet.

Then it is a simple matter of activation one module at a time (on modules that already render line-based)

The flow is

Add the line-based callback placeholders, according to instructions at #10758 which does precisely what you just described.
It's just merely simply a modified header file, and all the modules need to have dummy empty functions added. That's it.
Add VSYNC OFF frameslice beamracing (any graphics API capable of tearlines can do it).
Then implement it on ONE module (one that already renders line-based, like the NES module).

Step 1 is easier than you think, if you ANY raster interrupt experience at all. Step 2 simply needs to gain some

johnnovak commented 2 years ago

☝🏻 I'm 99% sure the answer is similarly simple with VICE. The problem there is more the infrastructure side of things; now it's tightly coupled with GTK and it uses some vsync mechanism provided by GTK (well, the GTK3 version, at least; the SDL one would be easier to hack I assume).

Raster interrupts are common on the C64, so it's either already rendering by rasterline internally, or it would be trivial to add (haven't read the source yet).

People are vastly overestimating the difficulty of implementing this technique, I think... Okay, maybe in a very generic framework like RA it could be a little bit trickier.

libretro / RetroArch

Add Beam Racing/Scanline Sync to RetroArch (aka Lagless VSYNC) #6984

Feature Request Description

Bounty available

Synchronize emu raster with real world raster to reduce input lag

Already Proven, Already Working

Preservationist Friendly. Preserves original input lag accurately.

Less horsepower needed than RunAhead.

Diagrammatic Concept

Simple Pre-Requisites

The simplified retro_set_raster_poll API Proposal

Further Detail

Which emulators does this benefit?

Related Raster Work on GPUs

Common Developer Misconceptions

Proposal

Questions?

$120 Funds Now Added to BountySource

Trigger for BountySource completion:

Effort Assessment

It's Easier Than Expected. Learning/Debugging Is The Hard Part

BountySource Donation Dollar Match Thru the $360 Level

EDIT: Dollar match maxed out 2018/07/17 -- I've donated $360

Remember, I am the inventor of TestUFO.com and founder of BlurBusters.com

As a result, I know what I am talking about.

AND

BountySource now $140

BountySource now $200

$850 on BountySource

Now $1050 BountySource

Useful Links

Minimum Pre-Requisites for Cross-Platform Beam Racing

Quick Reference Of Available VSYNC timestamping APIs

Other workarounds for getting VSYNC timestamps in VSYNC OFF mode

Bounty now $1132

retro_set_raster_poll API proposal

Some emulator modules only need roughly 10 lines of modification

About Interlaced Modes

For simplicity, supporting beam racing during interlaced modes is not a mandatory requirement for claiming this bounty -- however it is easy to support or to add later (by a programmer who understands interlacing & deinterlacing).

Parameters for Low-Overhead Beam Raced Sync

TL;DR: Beam raced sync can use only 10% more CPU overhead than plain old VSYNC ON, with the right optimizations.

Concept of Hz-Agnostic Rolling Scan BFI (CRT Scanning Emulation)

Proposal: Easier BountySource Claim Flexibility

#10757 (BFIv3) Emulate a CRT Electron Gun Via Rolling-Scan BFI

Task Breakdown Simplification: