Open blurbusters opened 6 years ago
Additional timesaver notes:
General Best Practices
Debugging raster problems can be frustrating, so here's knowledge by myself/Calamity/Toni Wilen/Unwinder/etc. These are big timesaver tips:
Hopefully these best practices reduce the amount of hairpulling during frameslice beamracing.
Special Notes
Special Note about Rotation Emulator devices already should report their screen orientation (portrait, landscape) which generally also defines scan direction. QueryDisplayConfig() will tell you real screen orientation. Default orientation is always top-to-bottom scan on all PC/Mac GPUs. 90 degree counterclockwise display rotation changes scan direction into left-to-right. If emulating Galaxian, this is quite fine if you're rotating your monitor (left-right scan) and emulating Galaxian (left-right scan) -- then beamracing works._
Special Note about Unsupported Refresh Rates Begin KISS and worry about 50Hz/60Hz only first. Start easy. Then iterate in adding support to other refresh rates like multiples. 120Hz is simply cherrypicking every other refresh cycle to beam race. For the in-between refresh cycles, just leave up the existing frame up (the already completed frame) until the refresh cycle that you want to beamrace is about to begin. In reality, there's very few unbeamraceable refresh rates -- even beamracing 60fps onto 75Hz is simply beamracing cherrypicked refresh cycles (it'll still stutter like 60fps@75Hz VSYNC ON though)._
Advanced Note about VRR Beam Racing Before beam racing variable refresh rate modes (e.g. enabling GSYNC or FreeSync and then beamracing that) -- wait until you've mastered all the above before you begin to add VRR compatibility to your beamracing. So for now, disable VRR when implementing frameslice beamracing for the first time. Add this as a last step once you've gotten everything else working reasonably well. It's easy to do once you understand it, but the conceptual thought of VRR beamracing is a bit tricky to grasp at first. VRR+VSYNC OFF supports beamracing on VRR refresh cycles. The main considerations are, the first Present() begins the manually-triggered refresh cycle (.INVBlank becomes false and ScanLine starts incrementing), and you can then frameslice beamrace that normally like an individual fixed-Hz refresh cycle. Now, one additional very special, unusual consideration is the uncontrolled VRR repeat-refresh. One will need to do emergency catchup beamraces on VRR displays if a display decides to do an uncommanded refresh cycle (e.g. when a display+GPU decides to do a repeat-refresh cycle -- this often happens when a display's framerates go below VRR range). These uncommanded refresh cycles also automatically occur below VRR range (e.g. under 30fps on a 30Hz-144Hz VRR display). Most VRR displays will repeat-refresh automatically until it's fully displayed an untorn refresh cycle. If this happens and you've already begun emulating a new emulator refresh cycle, you have to immediately start your beamrace early (rather than at the wanted precise time). So if you do a frameslice beamrace of a VRR refresh cycle, the GPU will send a repeat-refresh to the display automatically immediately. There might be an API call to suppress this behavior, but we haven't found one, so this behavior is unwanted so this kind of makes beamraced 60fps onto a 75Hz FreeSync display difficult to do stutter-free. But it works fine for 144Hz VRR displays - we find it's easy to be stutterfree when the VRR max is at least twice the emulator Hz, since we don't care about those automatic-repeat-refresh cycles that aren't colliding with timing of the next beamrace.
Added $120 BountySource -- permanent funds -- no expiry.
https://www.bountysource.com/issues/60853960-lagless-vsync-support-add-beam-racing-to-retroarch
Minimum refresh rate required: Native refresh rate.
Emulator Support Preferances: Preferably including either NES and/or VICE Commodore 64, but you can choose any two emulators that is easiest to add beam-racing to).
Notes: GSYNC/FreeSync compatible beam racing is nice (works in WinUAE) but not required for this BountySource award; can be a stretch goal later. Must support at least native refresh rate (e.g. 50Hz, 60Hz) but would be a bonus to also support multiple thereof (e.g. 100Hz or 120Hz) -- as explained, this is done via automatically cherrypicking which refresh cycles to beamrace (WinUAE style algorithm or another mutually agreed algorithm).
Assessment is that Item 1 will probably require about a thousand-ish lines of code, while item 3 (modification to individual emulator modules) can be as little as 10 lines or thereabouts. 99% of the beam racing is already implemented by most 8-bit and 16-bit emulators and emulator modules, it's simply the missing 1% (sync between emuraster and realraster) that is somewhat 'complicated' to grasp.
The goal is to simplify and centallize as much complexity of the beam racing centrally as possible, and minimize emulator-module work as much as possible -- and achieve original-machine latencies (e.g. software emulator with virtually identical latencies as an original machine) which has already been successfully achieved with this technique.
Most of the complexity is probably testing/debugging the multiple platforms.
Tony of WinUAE said it was easier than expected. It's simply learning that's hard. 90% of your work will be learning how to realtime-beamrace a modern GPU. 10% of your time coding. Few people (Except Blur Busters) understand the "black box" between Present() and photons hitting eyes. But follow the Best Practices and you'll realize it is as easy as an E=mc^2 Eureka Moment if you've ever programmed an Amiga Copper or Commodore 64 raster interrupt, that modern GPUs are surprisingly crossplatform-beamraceable now via "VSYNC OFF tearlines are simply rasters. All tearlines created in humankind are simply rasters" technical understanding.
DOLLAR MATCH CHALLENGE -- Until End of September 2018 I will match dollar-for-dollar all additional donations by other users up to another $120. Growing my original donation to $240 in addition to $120 other people's donations = $360 BountySource!
How could this possibly be done reliably on desktop OSes (non-hard-realtime) where scheduling latency is random?
How could this possibly be done reliably on desktop OSes (non-hard-realtime) where scheduling latency is random?
See above. It's already in an emulator. It's already successfully achieved.
That's not a problem thanks to the jittermargin technique.
Look closely at the labels in Frame 3.
As long as the Present() occurs with a tearline inside that region, there is NO TEARING, because it's a duplicate frameslice at the screen region that's currently scanning-out onto the video cable. (As people already know, a static image never has a tearline -- tearlines only occurs with images in motion). The jitter margin technique works, is proven, is already implemented, and is already in a downloadable emulator, if you wish to see for yourself. In addition, I also have video proof below:
If you've seen the UFO or similar tests in any website (RTings, TFTCentral, PCMonitors, etc, they are likely using one of my display-testing inventions that I've got a peer-reviewed conference paper with NIST.gov, NOKIA, Keltek. So my reputation precedes me, and now that out of the way:
You can adjust the jittermargin to give as much as 16.7ms of error margin (Item 9 of Best Practices above). Error margin with zero artifacts is achieved via jittermargin (duplicate frameslice = no tearline). Testing shows we can go ~1/4 refresh cycle errormargin on PI/Android and sub-millisecond errormargin on GTX 1080Ti + i7 systems.
Some videos I've created of my Tearline Jedi Demo --
Here's YouTube video proof of stable rasters on GeForces/Radeons: THREAD: https://www.pouet.net/topic.php?which=11422&page=1
And the world's first real-raster cross platform Kefrens Bars demo
(8000 frameslices per second -- 8000 tearlines per second -- way overkill for an emulator -- 100 tearlines per refresh cycle with 1-pixel-row framebuffers stretched vertically between tearlines. I also intentionally glitch it at the end by moving around a window; demonstrating GPU-processing scheduling interference).
Now, its much more forgiving for emulators because the tearlines (That you see in this) is all hidden in the jittermargin technique. Duplicate refresh cycles (and duplicate frameslices / scanline areas) have no tearlines. You just make sure that the emulator raster stays ahead of real raster, and frameslice new slices onto the screen in between the emuraster & realraster.
As long as you keep adding frameslices ahead of realraster -- no artifacts or tearing shows up. Common beam racing margins with WinUAE successfully is approximately 20-25 scanlines during 10 frameslice operation in WinUAE emulator. So the margin can safely jitter (computer performance problems) without artifacts.
If you use 10 frameslices (1/10th screen height) -- at 60Hz for 240p, that's approximately a 1.67ms jitter margin -- most newer computers can handle that just fine. You can easily increase jitter margin to almost a full refresh cycle by adding distance between realraster & emuraster -- to give you more time to add new frameslices in between.
And even if there was a 1-frame mis-performance, (e.g. computer freeze), the only artifact is a brief sudden reappearance of tearing before it disappears.
Also, Check the 360-degree jittermargin technique as part of Step 9 and 14 of Best Practices, that can massively expand the jitter margin to a full wraparound refresh cycle's worth:
- Instead of rasterplotting emulator scanlines into a blank framebuffer, rasterplot on top of a copy of the the emulator previous refresh cycle's framebuffer. That way, there's no blank/black area underneath the emulator raster. This will greatly reduce visibility of glitches during beamrace fails (falling outside of jitter margin -- too far behind / too far ahead) -- no tearing will appear unless within 1 frameslice of realraster, or 1 refresh cycle behind. A humongous jitter margin of almost one full refresh cycle. And this plot-on-old-refresh technique makes coarser frameslices practical -- e.g. 2-frameslice beamracing practical (e.g. bottom-half screen Present() while still scanning out top half, and top-half screen Present() while scanning out bottom half). When out-of-bounds happens, the artifact is simply brief instantaneous tearing only for that specific refresh cycle. Typically, on most systems, the emulator can run artifactless identical looking to VSYNC ON for many minutes before you might see brief instantaneous tearline from a momentary computer freeze, and instantly disappear when beamrace gets back in sync.
- Become more familiar with how the jitter-margin technique saves your ass. If you do Best-Practice 9, you gain a full wraparound jittermargin (you see, step 9 allows you to Present() the previous refresh cycle on bottom half of screen, while still rendering the top half...). If you use 10 frameslices at 1080p, your jitter safety margin becomes (1080 - 108) = 972 scanlines before any tearing artifacts show up! No matter where the real raster is, you're jitter margin is full wraparound to previous refresh cycle. The earliest bound is pageflip too late (more than 1 refresh cycle ago) or pageflip too soon (into the same frameslice still not completed scanning-out onto display). Between these two bounds is one full refresh cycle minus one frameslice! So don't worry about even a 25 or 50 scanline jitter inaccuracy (erratic beamracing where margin between realraster and emuraster can randomly vary) in this case... It still looks like VSYNC ON perfectly until it goes out of that 972-scanline full-wraparound jitter margin. For minimum lag, you do want to keep beam racing margin tight (you could make beamrace margin adjustable as a config value, if desired -- though I just recommend "aim the Present() at 2 frameslice margin" for simplicity), but you can fortunately surge ahead slightly or fall behind lots, and still recover with zero artifacts. The clever jittermargin technique that permanently hides tearlines into jittermargin makes frameslice beam-racing very forgiving of transitient background activity._
And single-refresh-cycle beam racing mis-sync artifacts are not really objectionable (an instantaneous one-refresh-cycle reappearance of a tearline that disappears when the beam racing "catches up" and goes back to the jitter margin tolerances.)
240p scaled onto 1080p is roughly 4.5 real scanlines per 1 emulator scanline. Obviously, the real raster "Register" will increment scan line number roughly 4.5 times faster. But as you have seen, Tearline Jedi successfully beam-races a Radeon/GeForce on both PC/Mac without a raster register simply by using existing precision counter offsets. Sure, there's 1-scanline jittering as seen in YouTube video. But tearing never shows in emulators because that's 100% fully hidden in the jittermargin technique making it 100% artifactless even if it is 1ms ahead or 1ms behind (If you've configured those beam racing tolerances for example -- can be made an adjustable slider -- tighter for super fast more-realtime systems -- a looser for slower/older systems).
But we're only worried about screen-height distance between the two. We need to merely simply make sure the emuraster is at least 1 frameslice (or more) below the realraster, relative-screen-height-wise -- and we can continue adding VSYNC OFF frameslices in between emu raster and real raster -- creating a tearingless VSYNC OFF mode, because the framebuffer swap (Present() or glutSwapBuffers()) is a duplicate screen area, no pixels changed, so no tearline is visible. It's conceptually easy to understand once you have the "Eureka" moment.
There's already high speed video proof of sub-frame latencies (same-frame-response) achieved with this technique. e.g. mid-screen input reads for bottom-of-screen reactions are possible, replicating original's machine latency (to an error margin of one frameslice).
As you can see, the (intentionally-visible) rasters in the earlier videos are so stable and falls within common jittermargin sizes (for intentionally-invisible tearlines). With this, you create a (16.7ms - 1.67ms = 15ms jitter margin). That means with 10 frameslices with the refresh-cycle-wraparound jitter margin technique -- your beamracing can go too fast or too slow in a much wider and much safer 15ms range. Today, Windows scheduling is sub-1ms and PI schecduling is sub-4ms, so it's not a problem.
The necessary accuracy to do realworld beamracing happened 8-to-10 years ago already.
Yes, nobody really did it for emulators because it took someone to apply all the techniques together (1) Understanding how to beamrace a GPU, (2) Initially understanding the low level black box of Present()-to-Photons at least to the video output port signal level. (3) Understanding the techniques to make it very forgiving, and (4) Experience with 8-bit era raster interrupts.
In tests, WinUAE beam racing actually worked on a year-2010 desktop with an older GPU, at lower frameslice granularities -- someone also posted screenshots of an older Intel 4000-series GPU laptop in the WinUAE beamracing thread. Zero artifacts, looked perfectly like VSYNC ON but virtually lagless (well -- one frameslice's worth of lag).
Your question is understandable, but the fantastic new knowledge we all now have, now compensates totally for it -- a desktop with a GeForce GTX Titan about ~100x the accuracy margin needed for sub-refresh-latency frameslice beam racing.
So as a reminder, the accuracy requirements necessary to pull off this technical feat, already occured 8-to-10 years ago and the WinUAE emulator successfully is beamracing on an 8-year-old computer today in tests. I implore you to reread our research (especially the 18-point Best Practices), watch the videos, and view the links, to understand that it is actually quite forgiving thanks to the jittermargin technique.
(Bet you are surprised to learn that we are already so far past the rubicon necessary for this reliable accuracy, as long as the Best Practices are followed.)
Someone added $10, so I also added $10.
NOTE: I am currently dollar-matching donations (thru the $360 level) until end of September. Contribute to the pot: https://www.bountysource.com/issues/60853960-lagless-vsync-support-add-beam-racing-to-retroarch
Twinphalex added $30, so I also added $30.
Wow! bparker06 just generously donated $650 to turn this into an $850 bounty
(bparker06, if you're reading this, reach out to me, will you? -- mark@blurbusters.com -- And to reconfirm you were previously aware that I'm currently dollar-matching only up to the BountySource $360 commitment -- Thanks!)
I've topped up; and have donated $360 totalled -- the dollar-for-dollar matching limit promise I said earlier.
This is now number 32 biggest pot on BountySource.com at the moment!
So..... since this is getting to be serious territory, I might as well post multiple references that may be of interest, to help jumpstart any developers who may want to begin working on this:
Videos of GroovyMAME lagless VSYNC experiment by Calamity: https://forums.blurbusters.com/viewtopic.php?f=22&t=3972&start=10#p31851 (You can see the color filters added in debug mode, to highlight separate frameslices)
Screenshots of WinUAE lagless VSYNC running on a laptop with Intel GPU: http://eab.abime.net/showthread.php?p=1231359#post1231359 (OK: approx 1/6th frame lag, due to coarse 6 frameslice granularity.)
Corresponding (older) Blur Busters Forums thread: https://forums.blurbusters.com/viewtopic.php?f=22&t=3972
Corresponding LibRetro lag investigation thread (Beginning at post #628 onwards): https://forums.libretro.com/t/an-input-lag-investigation/4407/628
The color filtered frame slice debug mode (found in WinUAE, plus the GroovyMAME patch) is a good validation method of realtimeness -- visually seeing how close your realraster is to emuraster -- I recommend adding this debugging technique to the RetroArch beam racing module to assist in debugging beam racing.
As a reminder, our research has successfully simplified the minimum system requirements for cross-platform beam racing to just simply the following three items:
If you can meet (1) and (2) and (3) then no raster register is required. VSYNC OFF tearlines are just rasters, and can be "reliably-enough" controlled (when following 18-point Best Practices list above) simply as precision timed Present() or glutSwapBuffers() as precision-time-offsets from a VSYNC timestamp, corresponding to predicted scanout position.
While mentioned earlier, I'll resummarize compactly: These "VSYNC timestamp" APIs have suitable accuracies for the "raster-register-less" cross platform beam racing technique. Make sure to filter any timestamp errors and freezes (missed vsyncs) -- see Best Practices above.
If you 100% strictly focus on the VSYNC timestamp technique, these may be among the only #ifdefs that you need.
As tearlines are just rasters, it's good to know all the relevant APIs if need be. These are optional, but may serve as useful fallbacks, if need be (be sure to read Best Practices, e.g. expensiveness of API calls that we've discovered, and some mitigation techniques that we've discovered).
Be noted, it is necessary to use VSYNC OFF to use beam raced frameslicing. All known platforms (PC, Mac, Linux, Android) have methods that can access VSYNC OFF. On some platforms, this may interfere with your ability to get VSYNC timestamps. As a workaround you may have to instead poll the "In VBlank" flag (or busyloop in a separate thread for the bit-state change, and timestamp immediately after) -- in order to get VSYNC timestamps while in VSYNC OFF mode. Here are alternative APIs that helps you work around this, if absolutely necessary.
Windows D3DKMTGetScanLine() -- Windows equivalent of raster register. Also can be used to poll the "In VBLANK" status too. However, we found it unnecessary to do this, due to the existence of D3DKMTWaitForVBlank() which still works in VSYNC OFF mode. On the other hand, it may reduce need for tieing up a CPU core in precision-busylooping.
Linux get_scanout_position() -- Linux equivalent of raster register. Also can be used to poll the "In VBLANK" status too. drm_calc_vbltimestamp_from_scanoutpos() -- Linux Direct Rendering Manager (DRM) calculating VSYNC timestamps from scanout position. This can be quite handy to also do on Windows, as a low-CPU-method (no busylooping needed) method of generating VSYNC timestamps.
Currently, it seems implementations of get_vblank_timestamp() tend to call drm_calc_vbltimestamp_from_scanoutpos() so you may not need to do this. However, this additional information is provided to help speed up your research when developing for this bounty.
Any platform or module with already-implemented beam racing technique -- is allowed to be rolled into the bounty as long as it helps meets the bounty conditions (e.g. ported to retro_set_raster_poll technique)
Bounty may be split between multiple programmers (if all stakeholders mutually agree). I understand not everyone can program all platforms.
As you remember, retro_set_raster_poll is supposed to be called every time after an emulator module plots a scanline to its internal framebuffer.
As written earlier, basically retro_set_raster_poll (if added) simply allows the central RetroArch screen rendering code to optionally an "early peek" at the incompletely-rendered offscreen emulator buffer, every time the emulator modules plots a new scanline.
That allows the central code to beam-race scanlines (whether tightly or loosely, coarsely or ultra-zero-latency realtimeness, etc) onto the screen. It is not limited to frameslice beamracing.
By centralizing it into a generic API, the central code (future implementations) can decide how its wants to realtime-stream scanlines onto the screens (bypassing preframebuffering). This maximizes future flexibility.
The bounty doesn't even ask you to implement all of this. Just 1 technique per 3 platforms (one for PC, one for Mac, one for Linux). The API simply provides flexibility to add other beamracing workflows later. VSYNC OFF frameslicing (essentially tearingless VSYNC OFF / lagless VSYNC ON) is the easiest way to achieve.
Each approach has their pros/cons. Some are very forgiving, some are very platform specific, some are ultra-low-lag, and some work on really old machines. I simply suggest VSYNC OFF frameslice beamracing because that can be implemented in exactly the same way on Windows+Mac+Linux, so is the easiest. But one realizes there's a lot of flexibility.
The proposed retro_set_raster_poll API call would be called at roughly the horizontal scanrate (excluding VBI scanlines). Which means for 480p, that API call would be called almost ~31,500 times per second. Or 240p that API would be called almost ~15000 times per second.
While high -- the good news is that this isn't a problem because most API calls would be an immediate return for coarse frameslicing. For example, WinUAE defaults at 10 frameslices per refresh cycle, 600 frameslices per second. So retro_set_raster_poll would simply do nothing (return immediately) until 1/10th of a screen height's worth of emulator scanlines are built up. And then will execute.
So out of all those tens of thousands of retro_set_raster_poll calls, only 600 would be 'expensive' if RetroArch is globally configured to be limited to 10-frameslice-per-refresh beam racing (1/10th screen lag due to beam chase distance between emuraster + realraster). The rest of the calls would simply be immediate returns (e.g. not a framesliceful built up yet).
The complexity is centralized.
The emulator module is simply modified (hopefully as little as 10 line modification for the easiest emulator modules, such as NES) to call retro_set_raster_poll on all platforms. The beam racing complexity is all hidden centrally.
Nearly all 8-bit and 16-bit emulator modules already beamrace into their own internal framebuffers. Those are the 'easy' ones to add the retro_set_raster_poll API. So those would be easy. The bounty only needs 2 emulators to be implemented.
The central would decide how to beam race obviously (but frameslice beam racing would be the most crossplatform method, but it doesn't have to be the only method). Platform doesn't support it yet? Automatically disable beamracing (return immediately from retro_set_raster_poll). Screen rotation doesn't match emulator scan direction? Ditto, return immediately too. Whatever code a platform has implemented for beam racing synchronization (emuraster to realraster), it can be hidden centrally.
That's what part of bounty also pays for: Add the generic crossplatform API call so the rest of us can have fun adding various kinds of beam-racing possibilities that are appropriate for specific platforms. Obviously, the initial 3 platforms need to be supported (One for Windows, one for Mac, and one for Linux) but the fact that an API gets added, means additional platforms can be later supported.
The emulators aren't responsible for handling that complexity at all -- from a quick glance, it is only a ~10 line change to NES, for example. No #ifdefs needed in emulator modules! Instead, most of the beam racing sync complexity is centrallized.
Would the behavior need to be adjusted for emulators that output interlaced content momentarily?
The SNES can switch from interlaced output to progressive during a vblank. Both NTSC and PAL are actually interlaced signals and the console is usually just rendering even lines (or is it odd lines? I don't recall now) using a technique commonly referred to double-strike.
I don't see why that would matter, the only requirement here is that the core can be run on a per scanline basis, and that the vertical refresh rate is constant and close to the monitor rate.
I'm still wrapping my head around it, but yeah, now I see it. Interlaced content would be handled internally by the emulator as it already does.
No, behaviour doesn't need to be adjusted for interlaced.
Interlaced is still 60 temporal images per second, basically half-fields spaced 1/60 sec apart.
Conceptually, it's like frames that contains only odd scanlines, then a frame containing only even scanlines
Conceptually, you can think of interlaced 480i as the following:
T+0/60sec = the 240 odd scanlines T+1/60sec = the 240 even scanlines T+2/60sec = the 240 odd scanlines T+3/60sec = the 240 even scanlines T+4/60sec = the 240 odd scanlines T+5/60sec = the 240 even scanlines
Etc.
Since interlaced was designed in the analog era where scanlines can be arbitrarily vertically positioned anywhere on a CRT tube -- 8-bit-era computer/console makers found a creative way to simply overlap the even/odd scanlines instead of offset them (between each other) -- via a minor TV signal timing modification -- creating a 240p mode out of 480i. But 240p and 480i still contains exactly 60 temporal images of 240 scanlines apiece, regardless.
Note: With VBI, it is sometimes called "525i" instead of "480i"
Terminologically, 480i was often called "30 frames per second" but NTSC/PAL temporal resolution was always permanently 60 fullscreen's worth of scanouts per second, regardless of interlaced or progressive. "Frame" terminology is when one cycle of full (static-image) resolution is built up. However, motion resolution was always 60, since you can display a completely different image in the second field of 480i -- and Sports/Soap operas always did that (60 temporal images per second since ~1930s).
Deinterlacers may use historical information (the past few fields) to "enhance" the current field (i.e. converting 480i into 480p). Often, "bob" deinterlace are beam racing friendly. For advanced deinterlacing algorithms, what may be displayed is an input-lagged result (e.g. lookforward deinterlacer that displays the intermediate middle combined result of a 3-frame or 5-frame history -- adding 1 frame or 2 frames lag). Beam racing this will still have a lagged result like any good deinterlacer may have, albiet with slightly less lag (up to 1 frame less lag).
Now, if there's no deinterlacing done (e.g. original interlacing preserved to output) then deinterlacing lag (for lookforward+lookbackward deinterlacers) isn't applicable here.
Emulators typically generally handle 480i as 60 framebuffers per second. That's the proper way to do it, anyway -- whether you do simple bob deinterlace, or any advanced deinterlace algorithms.
I used to work in the home theater industry, being the moderator of the AVSFORUM Home Theater Computers forums, and have worked with vendors (including working for RUNCO as a consultant) on their video processor & scaler products. So I understand my "i" and "p" stuff...
If all these concepts this is too complicated, just add it as an additional condition to automatically disable beam racing ("If in interlaced mode instead of progressive mode, disable the laggy deinterlacer or disable beam racing").
Most retro consoles used 240p instead of 480i. Even NTSC 480i (real interlacing) is often handled as 60 framebuffers per second in an emulator, even if some sources used to call it "480i/30" (two temporal fields per frame, offset 1/60sec apart).
Note: One can simply visually seamlessly enter/exit beamracing on the fly (in real time) -- there might be one tiny microstutter during the enter/exit (1/60sec lag increase-decrease) but that's an acceptable penalty during, say, a screen rotation or a video mode change (most screens take time to catch up in mode changes anyway). This is accomplished by using one VBI-synchronized full buffer Present()s per refresh (software-based VBI synchronization) instead of mid-frame Present()s (true beam racing). e.g. during screen rotation when scanout directions diverge (realworld vs emu scanout) but could include the entering/exiting interlaced mode in the SNES module, if SNES module is chosen to be the two first modules to support beam racing as part of the bounty requirements. Remember, you only need to support two emulator modules to claim the bounty. If you choose an SNES module as part of the bounty, then the SNES module would still count towards the bounty even if beamracing was automatically disabled during interlaced mode (if too complex to wrap your head around it).
Formerly someone (Burnsedia) started working on this BountySource issue until they realized this was a C/C++ project. I'm updating the original post to be clear that this is a C/C++ skilled project.
@Burnsedia Your past track record on bountysource came to my attention, you marked 5 bounties as "solving", yet all of them are still open. Since I expect you to have a solid understanding of C and the required knowledged of how graphics API's work internally, could you please elaborate on how you would implement this feature? If you can't answer this, I will need you to refrain yourself from taking our bounties on, as I fear you could lock up high value bounties for no reason - effectively stalling progress on this or other bounties.
Has anyone tried this on Nvidia 700 or 900 series cards. I have had major issues with these cards and inconstant timing of the frame-buffer. The time at which the frame-buffer is actually sampled can vary by as much as half a frame making racing the beam completely impossible.
The problem stems from an over-sized video output buffer and also memory compression of some kind. As soon as the active scan starts the output buffer is filled at an unlimited rate (really fast), this causes the read position in the frame-buffer to pull way ahead of the real beam position. The output buffer seems to store compressed pixels, for a screen of mostly solid color about half a frame can fit in the output buffer, for a screen of incompressible noise only a small number of lines can fit and therefor has much more normal timing.
This issue has plagued my mind for several years (I gave my 960 away because it bothered me so much), but I have yet to see any other mentions of this issue. I only post this here now because its relevant.
Bountysource increased to $1142.
Someone should close this issue and apologize to bakers.
@casdevel I'm sorry, why? What we need instead is for a bounty hunter to take this up, or for more people to fund it.
You did good work in the past, so I legitimately am dumbfounded by this response.
What we need instead is some optimism here and somebody who would like to see this happen bring it into fruition.
I'm just sharing my opinion and trying to be a realistic, word "lag-less" shouldn't exist in programmer vocabulary.
OK, I'll rewrite it to 'beam racing' then. Or perhaps 'scanline sync' is a more appropriate term.
For reference, Riva Tuner (RTSS) has an implementation similar to this, it's called Scanline Sync.
Renamed. Here is more info about Scanline Sync - https://forums.blurbusters.com/viewtopic.php?f=10&t=4916
this was a fascinating read. is the title still accurate?
Yes, it can work simultaneously with RunAhead (if need be, though not necessary). Simply beam race the final/visible frame.
Simply doing that negates the advantages of raster syncing, though.
To properly combine them, you would do this (ignoring the v-blanking period for simplification):
Run the emulator for f scanlines; save state Emulate N frames, and beamsync the last f scanlines; load state
where N is the runahead value and f is the number of scanlines per frameslice.
This effectively multiplies the amount of work you have to do per frame by the number of visible frameslices per emulated v-blanking interval. So if the screen is divided into four frameslices, you do four times the work you would do with runahead alone.
Essentially, the runahead algorithm remains unchanged, but we operate on scan lines as opposed to discrete frames. You could even make it possible to specify runahead values in scan lines.
this was a fascinating read. is the title still accurate?
Title is no longer accurate, I'll edit. Someone retracted their share of BountySource, alas. My donations/contributions remain valid though. Please refer to the BountySource link for the most updated already-donated pot.
Once someone "Starts" the project, it locks the bounty until completion (if within reasonable time).
This effectively multiplies the amount of work you have to do per frame by the number of visible frameslices per emulated v-blanking interval. So if the screen is divided into four frameslices, you do four times the work you would do with runahead alone.
Very interesting concept to do sub-frame runahead! Yes, this would potentially work, but a bit complicated considering beam racing is a bit complicated by itself for some.
However, practically: -- It'd be simpler to keep runahead at frame intervals -- it'd be simpler to keep beamracing as its own sync mode -- But work in a way that allows each other to layer on top of each other (getting even further reduced latency by combining runahead+beamracing, but only up to subframe-reductions)
One big raison d'etre of beamraced sync is that it's very scaleable from low-end to high-end. It's just a simply a matter of upgrading/downgrading your raster timing precision and frameslice count. Even the lowest-end Android GPUs can handle beamraced sync when the parameters are configured accordingly.
I'm just sharing my opinion and trying to be a realistic, word "lag-less" shouldn't exist in programmer vocabulary.
As Einstein says, "It's all relative". Lagless is relative to original hardware as a differential. As in emulators that add no extra lag relative to original machine.
Assuming you do 15,625 frameslices per second (beamrace at NTSC scanrate), using single-scanline frameslices output directly to a front buffer (bypassing VSYNC OFF), even going to per-pixel inside-scanline (which is possible with some frontbuffer architectures), with no jitter safety margin -- it essentially becomes virtually 0 difference between emulator and original machine. Software based emulators achieving FPGA latency symmetry. In theory anyway.
I donated several hundred dollars in this bounty prize that still remains open. Googling "lagless VSYNC" brings up this item. By renaming it, you affect something that I contributed money to -- so I have done a compromise-renameback to preserve a reasonable modicum of Google SEO...
Even though GPUs cannot achieve that many frameslices per second or synchronous updates per second at such precisions, you'll be able to get closer and closer to original machine. Some GPUs manage to pull of near per-line accuracy, as already seen in the videos.
Your comment came more than a year after this was already created. There is already Google SEO and media articles that use "Lagless VSYNC" nomenclature. Although the prevailing terminology is now "Beam Raced Sync" or "Scanline Sync" they are synonyms to "Lagless VSYNC" as being the mathematically lowest-possible-lag-penalty Direct3D/OpenGL workflows for raster-based emulators that uses direct emulation (no runaheads).
Blur Busters was the one who convinced Guru3D to add the Scanline Sync mode to RTSS, and it has became a popular alternate sync mode in lowering input lag, when used as the right tool for the right job. Alas, that is framebuffer based (like 1-frameslice beamracing), so doesn't reduce lag to subframe levels.
For emulator frameslice beamracing -- achieving the zero differential feels impossible like trying to reach the speed of light, but one can get closer [5ms, 1ms, 0.5ms, 0.1ms....] to latency symmetry with original machines. Frame-granularity emulation can never do that.
While the terminology "Lagless VSYNC" may now be deprecated in favour of other more popular terms like "Scanline Sync" -- it does requires qualifications (approach zero relative to original -- in "Einstein is relative" manner). It is the only mathematical path that can approach zero differential between original and emulator.
Beamraced sync is the most mathematically perfect way for traditional software-based emulators to achieve latency symmetry with original machines via single-pass execution [i.e. no runahead].
Oh, and by the way, use the power management API to turn off power management during beam racing. Power management interferes a lot -- LOT. I find beam racing improves a lot if power management is turned off. It's also possible to self-monitor when beam racing is erratic and do corrective actions (e.g. notify user of poor quality sync between emuraster / realraster).
Now, that doesn't mean always disable power management. Just provide as an option for high-quality beam racing. For the low-power route, if you use low-granularity frameslices and have 1ms accurate timers, even 4-frameslice or 6-frameslice beamracing is still precise enough with low-quality 1ms timers (A 60Hz refresh cycle is 16.7ms, and a single frameslice is a quarter of that -- about 4ms. Doable with low precision 1ms timers)
Beam racing can be reliable even on Raspberry PI when tweaked properly. WinUAE beamracing works reasonably well on very unpowered laptops with 10-year-old Intel GPUs when configured properly. It uses less CPU than runahead, so beamraced sync is a lower-processing-cost lag lowering technique if you find an equilibrium (e.g. low frameslice count such as 4 to 10). Trying to do frameslice-level runahead will more than eliminate this capability.
Getting ultralow-lag out of embedded GPU/CPUs or processor-heavy for taxing emulator modules is easier with beamraced sync only (at low frameslice granularity, or by enabling front-buffer mode.
Properly done, beamraced sync only adds less than 1.1x to 1.5x processing power. Basically needing only 10% to 50% more CPU to beamrace (you can even turn off busywaits and simply use millisecond timers as described above! That's less than even basic runahead! Assuming you properly adjust to an equilibrium. The precision can be selectable: Generic 1ms Timer Event | Precise Timer Event | Ultra Precision Busywaits ..... Even 4-frameslice coarse-granularity beam raced sync, easily achieved using plain old fashioned 1ms timers, is still subframe latency achieved without runahead. To help out laptop memory (shared memory), you can do partial-framebuffer blit flags, so instead of hammering the RAM, you just blit only the framesliceful region.
Low lying apple tricks can turn beamraced sync into a low-overhead operation. Blitting 4 fractions of a refresh cycle is the same amount of bytes transferred as one full framebuffer full; many graphics drivers let you do that too. There's flags that already exist in Windows drivers. This keeps things gentle on shared RAM, to the point where even an old Intel GPU was successfully able to do 10-frameslice-count WinUAE (600 frameslices per second), including enough processing room left to add a CRT filter.
There's flags to make Present() cause only a partial memory transfer between the software and the display RAM -- flags already exist. That works to our favour in transferring beamraced frameslices on slow shared-memory systems. So 600fps is really 600 tenth-frames per second instead, with the same number of bytes transferred, you know!
In my tests, with low-frameslice-granularity emuraster-realraster sync, it definitely uses less overhead than runahead, so this could be excellent lag-eliminator for Android boxes and low-cost emulators, should someone take upon this project.
Runahead and beamraced-sync both have their respective pros/cons, but having sub-2x overhead isn't one of runahead's advantages which has limited the ability to use runahead on emulator modules that use 60%-70% CPU (i.e. on lower end CPUs like Raspberry PI CPU, or a powerful cycle-exact emulator). In that situation you need beamraced sync as a practical purist method of reducing input latency with less CPU overhead.
GPU power management tip: Make sure you don't have more than approximately 0.5ms to 1ms of idle time between frameslices. For some reason, emulators tend to aggressively trigger GPU power management, because GPUs love to sleep for many milliseconds (>8ms) at such random times when it detects an idle moment. This totally messes up beamraced sync. Setting "Performance Mode" completely solves this. But often "Balanced Mode" is sometimes enough /provided/ you use enough frameslice count. Basically, adjust your frameslice count to a goldilocks count. Using too low a frameslice count can cause power management to automatically occur (adding many milliseconds of mis-sync between emuraster and realraster). This unexpected behaviour creates an addition to best practices: Don't use too low a frameslice count if your hardware is capable enough. (Even increasing frameslice count from 4 to 6 to 8 on a 10-year old GPU sometimes improved things!). You could even design things to make frameslice count dynamic, automatically adapt to your hardware capabilities. GPU/CPU low %....automatically increase frameslices. CPU/GPU high %....automatically decrease frameslices. Frameslices don't even have to be exact same sizes. The frameslice can be a varying thickness throughout the refresh cycle, even! The chasebehind would simply be a scanlines-number offset, that is roughly optimized to prevent artifacts from appearing (a slider can easily adjust this vertical chase distance between emuraster and realraster ... you'd simply eyeball for artifacts during horizontal scrollers, to calibrate for reasonably low artifacts-free lag). Anyway, this is just an idea. For now, keep things simple, but just saying...
Benchmarking efficiency of beamraced sync is extremely hard because a lot of the horsepower is consumed by memory transfers of high-rate Present() as well as the busywaits. But all of that is optional
(for mobile, Raspberry PI, and high-%-CPU emulator modules)
Note about power management modes: Battery Saver Mode will kill beam racing precision, but Balanced mode will work on some hardware, while CPU 100%, GPU 100%, High Performance Mode on highend GPUs. Mobile processors can get latency-symmetry to originals/FGPA engines to within approximately ~3.2ms error (10 frameslice at 2 frameslice jitter margin), while high end desktop GPUs on i7 can achieve latency-symmetry to originals/FPGA engines to within less than 1ms error.
Once configured to these parameters, I've seen CPU overheads surge to only approximately 1.1x
Aside: Hmmmmm (sudden lightbulb idea). Although this wasn't why I created this item. I wonder if a profit motive by a portable emulator hardware developer would be willing to take upon this project to gain near-FPGA-league latency in RetroArch or RetroPIE on more heavy emulators on cheaper CPU/GPU hardware. If they knew, they'd dump hundreds on this github bounty in a "Take My Money" rush. If only the emulator hardware manufacturers knew how valuable this github item was! Running more powerful emulators at ultralow lag on cheap-BOM devices that can sell at bigger profit margins. Maybe some of you want to ask them to contribute to this BountySource? ;) ...
I continue to be impressed at how scalable up/down beamraced sync is. It's surprising horsepower-heavy at some settings, but surprisingly horsepower-miserly when optimized with some compromise settings. Brilliant as Runahead is, using Runahead can never be as low-overhead as efficiency-optimized beamraced sync!
Update: This issue is owned by me.
The ghost account is because when I switched from personal account to a corporate account, this original account became 'ghost'. Ugh -- an unintended consequence.
Hi there, I'm still trying to find willing bounty hunters able to take on this task. It's still on my radar and I'm still trying to sweeten the pot for whichever would-be developer would like to take on the challenge.
I noticed that a few did withdraw their share of bounty, but my bounty certainly still remains --
Update, @TomHarte added beam racing to the Mac branch of his crossplatform CLK emulator https://arstechnica.com/civis/viewtopic.php?p=38773471#p38773471
He uses the technique of a time offset between CVDisplayLink vertical timing events, to guesstimate the raster to an accuracy margin sufficient enough for 600-frameslice beam racing (sub-refresh-cycle latency) for the majority of Macs he has tried it on.
I'm willing to propose simplifying the requirements (to remove Mac and Linux for now) if nobody else contributes to the BountySource. However, it should be architectured in a way that future Linux and Mac support should be easily added.
TomHarte (Thomas Harte) successfully added beam racing to the Mac branch of his cross-platform emulator CLK, so there's already some sample code available.
I am cross posting some messages from the Ars comments section:
Why beam raced sync latency is much more accurate than RunAhead latency: https://arstechnica.com/civis/viewtopic.php?p=38773166#p38773166
And recommended simple method of debugging beam racing (striped strips on border): https://arstechnica.com/civis/viewtopic.php?p=38773501#p38773501
The benefit of putting this type of work into more emulators is FPGA-league latency identicalness.
Even RetroArch's RunAhead (ArsTechnica article) which is amazing, can't achieve faithful latency like FPGA, due to things like latency-nonlinearity. For RunAhead, emulated scan rate in RunAhead is always faster than realtime, generating rasters faster than realtime to offscreen frame buffers. This distorts time between mid-screen inputreads (sawtooth latency effects at 60 sawtooths per second, distorting latency extremes to different points of the 1/60sec = 16.7ms time windows), even if you manage to match average latency to original machine or FPGA machine. Also Game A may inputread at end of VBI, and another Game B may inputread at beginning of VBI. And Game C may inputread at raster # 002 or # 199 or whatever. So input lag differentials between a RunAhead emulator and the original machine (at same RunAhead setting), will vary in a window of [0..16.7ms] because of the latency distortions within the RunAhead algorithm. Compare this 16.7ms of lag nonlinearity to the 1ms lag-behavior symmetry achieved by WinUAE's GPU-beamraced sync.
With RunAhead and photodiode oscilloscope, one can attempt calibrate lag of one game to match original machine. But a different game will diverge in lag from original machine as a result of that original calibration. So worst-case lag non-faithfulness difference is 16.7ms between Game X (vs original machine) and Game Y (vs original machine) with identical RunAhead settings. So RunAhead can't create universal latency faithfulness. It's a stunning emulator innovation, but latency purists know it does not duplicate FPGA latency behaviors.
Thus, while RunAhead is an amazing invention, it can never replicate "original input latency" or replicate "FPGA league latency" in a software emulator as well as beam raced sync algorithms (emuraster-realraster synchronization).
So there are now PC emulators and Mac emulators using beamraced sync now,
Sorry, to be fully clear: I've implemented all the parts of raster racing, but not yet flipped the switch for a combination of factors. But, of specific interest:
I[use dispatch_source_set_timer
for timing, feeding a dedicated dispatch queue. It not only takes a nanosecond-precision period but actually seems to do a pretty good job of honouring it, at least if comparing to std::chronological::high_resolution_clock
is valid.
CVDisplayLink
provides retrace notifications, naturally. It provides a retrace period and a frequency; the only thing I initially got wrong was not making sure to create a new link each time my application moves to a new display. I have both my laptop's built-in display and an external monitor, with different display rates; using CVDisplayLinkCreateWithActiveCGDisplays
to create "a display link capable of being used with all active displays" gave me a synthetic timer not actually tied to either display's retraces.
The process I settled on was just setting the dispatch source to a moderately-high number; each timer window is treated as a single discrete time step except for those in which a display-link callback has fallen. Those are split into two parts, before the callback and afterwards.
If the display's rate and emulated machine's rate are sufficiently compatible then a phase-locked loop attempts to pull the emulated machine's vertical sync into phase with the host machine's. That's a permanent, ongoing process that occurs purely through observation of the video signal because several of the machines I attempt to emulate have variable, programmatic output rates.
At present I have chickened out and I just do a buffer copy at the identified divide point in the relevant timer window. For fixed-precision beam racing I'd just need to switch to doing a copy at the end of every timer window and use a fixed offset so that my PLL drags the emulated machine into phase one timer window ahead.
The reasons I haven't yet done are objectively insubstantial — I certainly don't see any technical barriers on macOS.
UPDATE
I'll now permit this alternative feature addition to claim the BountySource pot, if a programmer thinks this method is easier. If other BountySource donators agree, we can expand the qualifying criteria to choose this alternative approach (~$500 BountySource claim at https://www.bountysource.com/issues/60853960-add-beam-racing-scanline-sync-to-retroarch-aka-lagless-vsync ...)
Beam Racing Concept: Temporally Emulate a CRT Electron Gun Too, Not Just Spatially.
(And optionally, simultaneously beam race it -- basically using brute Hz as a coarse emuraster=realraster sync method)
Long-term, I’d like to see some emulators start to consider temporally emulating an electron gun. The sheer brute-force of refresh cycles (240Hz, 360Hz) can be used to create a granular CRT electron gun emulation.
Also, I posted a suggestion about a future “software-based rolling scan” for 240Hz and 360Hz monitors at the GroovyMAME forum – aka Temporal HLSL where you use the brute refresh rate to emulate a CRT electron gun at sub-refresh levels, but I should probably create a new forum thread for it. I also posted a issue at the MAME GitHub too as well.
High refresh rate HDR displays are good for 60Hz emulation because of:
However, I think this should become a new RetroArch issue being open too, as a long-term incubation. These are two separate issues (one or both of which I'm willing to finance incubation of, as a fan of emulators).
Theoretically, it is easier to implement than beamraced VSYNC, since we only need to worry about the display at the full refresh cycle level (plain ordinary old-fashioned VSYNC). The refresh rate race to retina refresh rates are producing a boom of high-Hz monitor.
We’re looking forward to the upcoming DELL 360Hz IPS monitor (AW2521H without the F suffix), which will allow high-quality 6-segment rolling bar emulation of a CRT electron gun.
The same proposed “retro_set_raster_poll” could still be added to RetroArch, to benefit this initiative too (not just beamraced VSYNC). Since that’s a univeral API for futureproofed beamracing techniques including this alternative “beam racing via brute Hz” approach.
Another bonus, software-based rolling-bar BFI doesn't need to care about display scanout direction, so it'd work on all display rotations.
If one is smart enough to architecture well... The frameslice beamracing workflow can be made futureproof.
Modular enough to output to either a hardware beamracer (e.g. VSYNC OFF frameslice beamracing the real raster) or software beamracer (e.g. multiple real refresh cycles per emulator refresh, including rolling-bar BFI).
That's why we need a retro_set_raster_poll API to be added to RetroArch, even pre-emptively (even without adding beam racing support yet). It opens-up an entire universe of possible real-world beamracing temporal preservation methods.
I suspect we've strayed beyond where any input from me is helpful, and I don't think I've fully implemented what is being asked, but in CLK all machines output a linear video stream which a virtual CRT transposes into 2d by the usual means of sync separation plus PLLs, so that exits the machine as a list of 2d raster scans with 1d streams of data attached. The whole thing is rendered as geometry, being at least one quad per line of output.
The scans are fed out in real time, I haven't done anything to make frames atomic — it's only if the host and emulated refresh rates are compatible and the two syncs are nudged into phase that you get something like traditional indivisible frame output.
However I deviate from what mdrejhon describes in that I blend each set of new scans on top of the old, because I was primarily fixated on the 50Hz @ 60Hz scenario rather than e.g. 60Hz @ 240Hz. So I mentally phrased this as motion aliasing versus softness and went with softness. You can see some tearing in high-speed 50Hz games but it's less offensive than it might have been because it's not a hard tear, and also avoids the extreme latency that would otherwise accrue if I sometimes effectively held back 50Hz input for two complete 60Hz frames.
I agree it would be smarter when output rates are much higher than the host machine to skip the blending — especially if/when wider dynamic ranges are available, and subject to having enough buffered that you can pause emulation on a complete display, of course. I'm not sure we'd necessarily agree on blending as I currently use it, but that is what I have currently implemented.
@TomHarte Would you be interested in taking on the bounty? We could add $200 to it as an additional sweetener.
of course. I'm not sure we'd necessarily agree on blending as I currently use it, but that is what I have currently implemented.
Actually, this is still useful information!
In an ideal world, blending is not normally necessary for hardware-based VSYNC OFF frameslice beamracing, if you use the jittermargin technique. (e.g. emulator-plotting the emulator raster ahead of real-raster). The blending never becomes visible, as long as the emulator raster stays ahead of real raster. In other words, we're simply using high-framerate VSYNC OFF as a stand-in for a front-buffer.
That said, blending might reduce the "tearing artifacts during computer slowdown" situation. If the real raster runs ahead of emulator raster, then you get VSYNC OFF tearing artifacts. So alphblending the boundaries of frameslices of refresh cycle ontop of the old refresh cycle, can in theory soften/reduce these "beamrace failure" artifacts. If you're finding blending mandatory because artfacts continually happen, then that's a problem to diagnose, too, in addition.
As a debug-assister, blending should be an option that can be temporarily disabled, so that things can be tweaked so that the beamraced sync is good enough that blending never helps. So the non-blending becomes a good beamrace debugger. If artifacts disappear and blending-vs-nonblending looks identical, emuraster is correctly permanently staying ahead of realraster.
That said, however blending would be absolutely mandatory if you use sheer-Hz VSYNC ON beamracing (e.g. rolling-scan beamracing of 60Hz emulator onto 240Hz LCD). Basically present the full frame buffers, which the display has to scanout anyway.
I think far ahead of most people, and just want to conceptualize a generic cross-platform beamraced frameslice delivery mechanism that supports both hardware-based beam racing (syncing emuraster to realraster) and software-based beam racing (software based rolling scan piggybacking on sheer Hz)
Heck, in fact, you could do both simultaneously anyway, so the venn diagram of hardware-based beam racing can overlap the software-based beam racing! During realHz=emuHz situation, you'd flywheel the sync appropriately and it's hardware beamraced sync, and blending becomes redundant. During "realHz far above emuHz" situation, you're pure software-based beamracing to the Hz granularity, and using the alphablend to hide the scanout seams between the refresh cycles of destination Hz.
TomHarte, is that what you're actually already doing? If so, then that's freaking brilliant. It's like a rolling-scan emulator, except without the blackframe portion being added yet (not yet emulating a CRT electron gun phosphor fade).
So just refinements+merger to ideas really. You might need to tweak it a bit, to add some jitter-margin awareness + adjustable-height blend gradient (vertical dimension), so that it goes tearingless / blendless when the flywheel sync is within the beamraced jitter margin above the blendarea. And less objectionable tearing artifacts when it does emuHz-vs-realHz does go out of sync.
I think I need to essentially rewrite the Beamraced VSYNC textbook, so that one beamrace approach could potentially covers all.
However I deviate from what mdrejhon describes in that I blend each set of new scans on top of the old, because I was primarily fixated on the 50Hz @ 60Hz scenario rather than e.g. 60Hz @ 240Hz
Actually, this isn't mutually exclusive! And a perfectly fantastic idea. See the "Concept of Hz-Agnostic Rolling Scan BFI" section at http://forum.arcadecontrols.com/index.php/topic,162926.0.html (scroll down to post #9) ... Except you're simply doing full persistence without a black fadebehind.
It would scale as little as you wish (like 50fps @ 60Hz blending) and scale as far as you wish (like 60fps at 1000Hz), without requiring the source Hz and destination Hz to be divisible.
Situation Example of 60Hz CRT emulation onto a 200Hz LCD
Emulator Refresh Cycle 1 ....Real Refresh 1: full 60/200th height bar (30% screen height), at 0%-30% vertical position ....Real Refresh 2: full 60/200th height bar (30% screen height), at 30%-60% vertical position ....Real Refresh 3: fuill 60/200th height bar (30% screen height), at 60%-90% vertical position ....Real Refresh 4: 1/3 of 60/200th height bar (10% screen height), at 90%-100% vertical position
Emulator Refresh Cycle 2 ....Real Refresh 5: 2/3 of 60/200th height bar (20% screen height), at 0%-20% vertical position ....Real Refresh 6: full 60/200th height bar (30% screen height), at 20%-50% vertical position ....Real Refresh 7: full 60/200th height bar (30% screen height), at 50%-80% vertical position ....Real Refresh 8: 2/3 of of 60/200th height bar (20% screen height), at 80%-100% vertical position
Emulator Refresh Cycle 3 ....Real Refresh 9: 1/3 of 60/200th height bar (10% screen height), at 0%-10% vertical position ....Real Refresh 10: full 60/200th height bar (30% screen height), at 10%-40% vertical position ....Real Refresh 11: full 60/200th height bar (30% screen height), at 40%-70% vertical position ....Real Refresh 12: full 60/200th height bar (30% screen height), at 70%-100% vertical position
In this situation, you don't have to worry about the hardware raster position. Just worry about ordinary 200fps VSYNC ON, just simply making sure there's no framedrops, and emulate your beamrace into that, while appropriately alpha-blending the seams (as TomHarte says). At sheer refresh rates like 360Hz, you can permit a little extra frame queue depth (e.g. 2 frames) to de-stutter any erratic computer performance, knowing it's a mere 2/360sec latency penalty. Since the effort of alphablending two frames is really fast (my GPU can do it about 1000 times a second).
For full persistence, all bars are illuminated (like emulator 50Hz on real 60Hz, or vice-versa), but for reduced persistence, you'd add blackness to create the rolling-bar black frame insertion, and appropriately alphablend from the frameslice towards black (Rather than alphablend to previous refresh cycle).
For bad mismatches, you don't have to bother to flywheel-sync the real-raster and emu-raster. For close matches (e.g. 59.94fps for a 59.97Hz display), it would be perfectly fine to flywheel sync (and speedup/slowdown audio very fractionally).
Essentially, the flywheel enable/disable feature would essentially glue together both the Lagless VSYNC approach and the Beam-Raced Temporal HLSL approach.
So the TomHarte approach and BlurBusters approach is mergeable! Beamracing via sheer Hz is much more crossplatform.
I now propose that the bounty qualifies for ANY kind of beamraced output, (A) The TomHarte approach of full persistence Hz-Agnostic rolling scan beam race method (B) The Blur Busters Lagless VSYNC Concept (C) The Beam Race Via Sheer Hz Concept (D) The Merged Version Thereof (BFIv3 "Temporal HLSL" that's also optionally capable of full persistence rolling-scan sample-and-hold, ala TomHarte)
The venn diagram of all the above, is essentially mergeable. But I would propose to prevent complicating things for the BountySource, I'd say, (A) or (B) or (C) or (D) alone, is sufficient enough to qualify for the bounty. Thoughts?
@TomHarte Would you be interested in taking on the bounty? We could add $200 to it as an additional sweetener.
Either way, for RetroArch, the first step is a retro_set_raster_poll hook, which will unlock any beamraced-output approaches, whether it be hardware-dependent or pure-software. Would you be able to add that?
Preprogramming this simple hook to would make it easier for anybody to add any beam raced output approachs to RetroArch.
Here you go:
Theoretically, both this GitHub item (#6984) and the BFIv3 (#10757) can essentially become an identical task.
This may be helpful for people who find BFIv3 conceptually easier to program than this GitHub item, though it would need a rolling full-persistence option too.
If programming from that angle, and later adding this GitHub item as a subset of #10757, perhaps using a @TomHarte-derived flywheel sync algorithm that triggers only when emuHz-realHz is close enough.
GitHub item #10758 is a pre-requisite for this.
I broke out the retro_set_raster_poll pre-requisite separately, because it's a universal requirement for all possible beamraceable output techniques.
Hey, I'd love to contribute some money to the bounty! But I see that it hasnt had anything added since 2018 and Im feeling hesitant, Is it worth doing it? Also it would be cool to promote in some way, I'm surprised I don't hear more people talking about it!
Hey, I'd love to contribute some money to the bounty! But I see that it hasnt had anything added since 2018 and Im feeling hesitant, Is it worth doing it? Also it would be cool to promote in some way, I'm surprised I don't hear more people talking about it!
It's still a valid bounty. Most of the funds are mine -- and this bounty will be honored.
There was a bit of talk about it in 2018, but currently quiet on these fronts at the moment.
The buzz can be restarted at pretty much any time, if a small group of us of similar interests can start a buzz campaign about this. Some of us have jobs though, or got affected by pandemic affecting work, and have to work harder to compensate, etc. But I'm pretty much 100% behind seeing this happen.
BTW, the new "240 Hz IPS" monitors are spectacular for RetroArch (even for 60Hz operation).
I find it so weird that there aren't dozens of devs jumping at the opportunity to implement this... More than 4 years have passed since this ticket was created and still no working implementation?! Huh?!
Input lag is one of THE most pressing issues that needs addressing in emulators, and WinUAE has proven that this technique works extremely well in practice. With the "lagless vsync" feature enabled in WinUAE with a frame-slice of 4, I really see zero reason to bother with real hardware anymore. The best of all — it works flawlessly with complex shaders! It's a huge game-changer, and I'm quite disappointed that developers of other emulators are so incredibly slow at adapting this brilliant technique.
For the record, I don't care about RetroArch at all, otherwise I'd be doing this. But I started nagging the VICE devs about it; their C64 emulator badly needs it (some C64 games are virtually unplayable with the current 60-100ms lag). Might follow my own advice and will implement it myself, eventually...
This bounty is solely for a RetroArch implementation.
We also regret that nobody has picked this up yet. We have tried funding it with money, clearly that is not enough. It has to come from the heart from someone passionate enough and capable to do it.
Yes. WinUAE has led the way, having already implemented this.
Someone needs to add retro_set_raster_poll placeholders (see #10758).
Then this task becomes much simpler.
As a reminder to all -- this techinique is really truly the only way to organically get universal organic original machine latency in an emulator (universal native-machine / FPGA-league latency originality). VSYNC OFF frameslice beam racing is the closest you can get to raster-plotting directly to front-buffer, one row at a time, in real time, in sync with the real world raster.
Same latency as original machine, to the error margin of 1-2 frameslices (subrefresh segments). Some of the faster GPUs can now exceed 10,000 frameslices per second.
We are rapidly approaching an era where we may be able to do full fine granularity NTSC scanrate too! (1-pixel tall VSYNC OFF frameslices -- e.g. each pixel row is its separate tearline)
Yes. WinUAE has led the way, having already implemented this.
Someone needs to add retro_set_raster_poll placeholders (see #10758). Then this task becomes much simpler.
Talked to the VICE people today about it. They're considering it, but some large scale refactorings will come first, which might take years.
I'd like to at least start implementing some of the auxiliary things which would be needed to get the whole thing going.
Thankfully blurbusters provided a lot of documentation and I feel like it should be possible to maybe break up all that has to be done into chunks. If we get some of these chunks done, even without a working implementation the entire thing might not seem so daunting to do.
As I've mentioned elsewhere, I believe one of the major hurdles for RetroArch/libretro in implementing this is that we typically work in full-frame chunks. That is, the core runs long enough to generate a frame's worth of audio and video, then passes it to the frontend. For this, we'll need to pass along much smaller chunks and sync them more often.
I suspect the cores that already use libco to hop between the core and libretro threads are probably going to be the lowest-hanging fruit. IIRC someone (maybe RealNC?) tinkered with this awhile back unsuccessfully, but I don't recall what exactly fell short.
As I've mentioned elsewhere, I believe one of the major hurdles for RetroArch/libretro in implementing this is that we typically work in full-frame chunks. That is, the core runs long enough to generate a frame's worth of audio and video, then passes it to the frontend. For this, we'll need to pass along much smaller chunks and sync them more often.
I suspect the cores that already use libco to hop between the core and libretro threads are probably going to be the lowest-hanging fruit. IIRC someone (maybe RealNC?) tinkered with this awhile back unsuccessfully, but I don't recall what exactly fell short.
That's exactly what the retro_set_raster_poll is designed to do. Please look at #10758. I've already addressed this.
Several emulators (e.g. NES) already render line-based.
We simply need to add callbacks there, and it will be highly configurable for the future, with any one or more of the following:
I actually already spent dozens of hours researching RetroArch's source code. It's simpler thank you think. The first step is adding the raster scan line callback to the existing RetroArch callback APIs -- header it out to template it in, even if no module is "activated" yet.
Then it is a simple matter of activation one module at a time (on modules that already render line-based)
The flow is
Add the line-based callback placeholders, according to instructions at #10758 which does precisely what you just described.
It's just merely simply a modified header file, and all the modules need to have dummy empty functions added. That's it.
Add VSYNC OFF frameslice beamracing (any graphics API capable of tearlines can do it).
Then implement it on ONE module (one that already renders line-based, like the NES module).
Step 1 is easier than you think, if you ANY raster interrupt experience at all. Step 2 simply needs to gain some
☝🏻 I'm 99% sure the answer is similarly simple with VICE. The problem there is more the infrastructure side of things; now it's tightly coupled with GTK and it uses some vsync mechanism provided by GTK (well, the GTK3 version, at least; the SDL one would be easier to hack I assume).
Raster interrupts are common on the C64, so it's either already rendering by rasterline internally, or it would be trivial to add (haven't read the source yet).
People are vastly overestimating the difficulty of implementing this technique, I think... Okay, maybe in a very generic framework like RA it could be a little bit trickier.
Feature Request Description
A new lagless VSYNC technique has been developed that is already implemented in some emulators. This should be added to RetroArch too.
Bounty available
There is currently a BountySource of about $500 to add the beam racing API to RetroArch plus support at least 2 emulator modules (scroll below for bounty trigger conditions). RetroArch is a C / C++ project.
Synchronize emu raster with real world raster to reduce input lag
It is achieved via synchronizing the emulator's raster to the real world's raster. It is successfully implemented in some emulators, and uses less processing power than RunAhead, and is more forgiving than expected thanks to a "jitter margin" technique that has been invented by a group of us (myself and a few emulator authors).
For lurkers/readers: Don't know what a "raster" or "beam racing" is? Read WIRED Magazine's Racing the beam article. Many 8-bit and 16-bit computers, consoles and arcade machines utilized similar techniques for many tricks, and emulators typically implement them
Already Proven, Already Working
There is currently discussion between other willing emulator authors behind the scenes for adding lagless VSYNC (real-world beam racing support).
Preservationist Friendly. Preserves original input lag accurately.
Beam racing preserves all original latencies including mid-screen input reads.
Less horsepower needed than RunAhead.
RunAhead is amazing! That said, there are other lag-reducing tools that we should also make available too.
Android and Pi GPUs (too slow for RunAhead in many emulators) even work with this lag-reducing technique.
Beam racing works on PI/Android, allows slower cycle exact emulators to have dramatic lag reductions, We have found it scales in both direction. Including Android and PI. Powerful computers can gain ultra-tight beam racing margins (sync between emuraster and realraster can be sub-millisecond on GTX 1080 Ti). Slower computers can gain very forgiving beam racing margins. The beam racing margin is adjustable -- can be up to 1 refresh cycle in size.
In other words, graphics are essentially raster-streamed to the display practically real-time (through a creative tearingless VSYNC OFF trick that works with standard Direct3D/OpenGL/Metal/etc), while the emulator is merrily executing at 1:1 original speed.
Diagrammatic Concept
Just like duplicate refresh cycles never have tearlines even in VSYNC OFF, duplicate frameslices never have tearlines either. We're simply subdividing frames into subframes, and then using VSYNC OFF instead.
We don't even need a raster register (it can help, but we've come up with a different method), since rasters can be a time-based offset from VSYNC, and that can still be accurate enough for flawless sub-millisecond latency difference between emulator and original machine.
Emulators can merrily run at original machine speed. Essentially streaming pixels darn-near-raster-realtime (submillisecond difference). What many people don't realize is 1080p and 4K signals still top-to-bottom scan like an old 60Hz CRT in default monitor orientation -- we're simply synchronizing to cable scanout, the scanout method of serializing 2D images to a 1D cable is fundamnetally unchanged. Achieving real raster sync between the emulator raster and real raster!
Many emulators already render 1 scanline at a time to an offscreen framebuffer. So 99% of the beam racing work is already done.
Simple Pre-Requisites
Distilling down to minimum requirements makes rasters cross-platform:
We use beam racing to hide tearlines in the jitter margin, creating a tearingless VSYNC OFF (lagless VSYNC ON) with a very tight (but forgiving) synchronization between emulator raster and real raster.
The simplified retro_set_raster_poll API Proposal
Proposing to add an API -- retro_set_raster_poll -- to allow this data to be relayed to an optional centralized beamracing module for RetroArch to implement realworld sync between emuraster and realraster via whatever means possible (including frameslice beam racing & front buffer beam racing, and/or other future beam racing sync techniques).
The goal of this API simply allows the centralized beamracing module to do an early peak at the incomplete emulator refresh cycle framebuffer every time a new emulator scan line has been plotted to it.
This minimizes modifications to emulators, allowing centralization of beam racing code.
The central code handle its own refresh cycle scanout synchronization (busylooping to pace correctly to real world's raster scan line number which can be extrapolated in a cross-platform manner as seen below!) without the emulator worrying about any other beam racing specifics.
Further Detail
Basically it's a beam-raced VSYNC OFF mode that looks exactly like VSYNC ON (perfect tearingless VSYNC OFF). The emulator can merrily render at 1:1 speed while realtime streaming graphics to the display, without surge-execution needed. This requires far less horsepower on the CPU, works with "cycle-exact" emulators (unlike RunAhead) and allows ultra low lag on Raspberry PI and Android processors. Frame-slice beam racing is already used for Android Virtual Reality too, but works successfully for emulators.
Which emulators does this benefit?
This lag reduction technique will benefit any emulator that already does internal beam racing (e.g. to support original raster interrupts). Nearly all retro platforms -- most 8-bit and 16-bit platforms -- can benefit.
This lag-reduction technique does not benefit high level emulation.
Related Raster Work on GPUs
Doing actual "raster interrupts" style work on Radeon/GeForces/Intels is actually surprisingly easy: tearlines are just rasters -- see YouTube video.
This provide the groundwork for lagless VSYNC operation, synchronization of realraster and emuraster. With the emulator method, the tearlines are hidden via the jittermargin approach.
Common Developer Misconceptions
First, to clear up common developer misconceptions of assumed "showstoppers"...
Proposal
Recommended Hook
It calls the raster poll every emulator scan line plotted. The incomplete contents of the emulator framebuffer (complete up to the most recently plotted emulator scanline) is provided. This allows centralization of frameslice beamracing in the quickest and simplest way.
Cross-Platform Method: Getting VSYNC timestamps
You don't need a raster register if you can do this! You can extrapolate approximate scan line numbers simply as a time offset from a VSYNC timestamp. You don't need line-exact accuracy for flawless emulator frameslice beamracing.
For the cross-platform route -- the register-less method -- you need to listen for VSYNC timestamps while in VSYNC OFF mode.
These ideally should become your only #ifdefs -- everything else about GPU beam racing is cross platform.
PC Version
Mac Version
Other platforms have various methods of getting a VSYNC event hook (e.g. Mac CVDisplayLinkOutputCallback) which roughly corresponds to the Mac's blanking interval. If you are using the registerless method and generic precision clocks (e.g. RTDSC wrappers) these can potentially be your only #ifdefs in your cross platform beam racing -- just simply the various methods of getting VSYNC timestamps. The rest have no platform-specificness.
Linux Version
See GPU Driver Documentation. There is a get_vblank_timestamp() available, and sometimes a get_scanout_position() (raster register equivalent). Personally I'd only focus on the obtaining VSYNC timestamping -- much simpler and more guaranteed on all platforms.
Getting the current raster scan line number
For raster calculation you can do one of the two:
(A) Raster-register-less-method: Use RTDSC or QueryPerformanceCounter or std::chrono::high_resolution_clock to profile the times between refresh cycle. On Windows, you can use known fractional refresh rate (from QueryDisplayConfig) to bootstrap this "best-estimate" refresh rate calculation, and refine this in realtime. Calculating raster position is simply a relative time between two VSYNC timestamps, allowing 5% for VBI (meaning 95% of 1/60sec for 60Hz would be a display scanning out). NOTE: Optionally, to improve accuracy, you can dejitter. Use a trailing 1-second interval average to dejitter any inaccuracies (they calm to 1-scanline-or-less raster jitter), ignore all outliers (e.g. missed VSYNC timestamps caused by computer freezes). Alternatively, just use jittermargin technique to hide VSYNC timestamp inaccuracies.
(B) Raster-register-method: Use D3DKMTGetScanLine to get your GPU's current scanline on the graphics output. Wait at least 1 scanline between polls (e.g. sleep 10 microseconds between polls), since this is an expensive API call that can stress a GPU if busylooping on this register.
NOTE: If you need to retrieve the "hAdaptor" parameter for D3DKMTGetScanLine -- then get your adaptor URL such as \.\\DISPLAY1 via EnumDisplayDevices() ... Then call D3DKMTOpenAdapterFromHdc() with this adaptor URL in order to open the hAdaptor handle which you can then finally pass to D3DKMTGetScanLine that works with Vulkan/OpenGL/D3D/9/10/11/12+ .... D3DKMT is simply a hook into the hAdaptor that is being used for your Windows desktop, which exists as a D3D surface regardless of what API your game is using, and all you need is to know the scanline number. So who gives a hoot about the "D3DKMT" prefix, it works fine with beamracing with OpenGL or Vulkan API calls. (KMT stands for Kernel Mode Thunk, but you don't need Admin priveleges to do this specific API call from userspace.)
Improved VBI size monitoring
You don't need raster-exact precision for basic frameslice beamracing, but knowing VBI size makes it more accurate to do frameslice beamracing since VBI size varies so much from platform to platform, resolution to resolution. Often it just varies a few percent, and most sub-millisecond inaccuracies is easily hidden within jittermargin technique.
But, if you've programmed with retro platforms, you are probably familiar with the VBI (blanking interval) -- essentially the overscan space between refresh cycles. This can vary from 1% to 5% of a refresh cycle, though extreme timings tweaking can make VBI more than 300% the size of the active image (e.g. Quick Frame Transport tricks -- fast scan refresh cycles with long VBIs in between). For cross platform frameslice beamracing it's OK to assume ~5% being the VBI, but there are many tricks to know the VBI size.
Turning The Above Data into Real Frameslice Beamracing
For simplicity, begin with emu Hz = real Hz (e.g. 60Hz)
Note: 120Hz scanout diagram from a different post of mine. Replace with emu refresh rate.matching real refresh rate, i.e. monitor set to 60 Hz instead. This diagram is simply to help raster veterans conceptualize how modern-day tearlines relates to raster position as a time-based offset from VBI
Bottom line: As long as you keep repeatedly Present()-ing your incompletely-rasterplotted (but progressively more complete) emulator framebuffer ahead of the realraster, the incompleteness of the emulator framebuffer never shows glitches or tearlines. The display never has a chance to display the incompleteness of your emulator framebuffer, because the display's realraster is showing only the latest completed portions of your emulator's framebuffer. You're simply appending new emulator scanlines to the existing emulator framebuffer, and presenting that incomplete emulator framebuffer always ahead of real raster. No tearlines show up because the already-refreshed-part is duplicate (unchanged) where the realraster is. It thusly looks identical to VSYNC ON.
Precision Assumptions:
Special Note On HLSL-Style Filters: You can use HLSL/fuzzyline style shaders with frameslices. WinUAE just does a full-screen redo on the incomplete emu framebuffer, but one could do it selectively (from just above the realraster all the way to just below the emuraster) as a GPU performance-efficiency optimization.
Adverse Conditions To Detect To Automatically disable beamracing
Optional, but for user-friendly ease of use, you can automatically enter/exit beamracing on the fly if desired. You can verify common conditions such as making sure all is me:
Exiting beamracing can be simply switching to "racing the VBI" (doing a Present() between refresh cycles), so you're just simulating traditional VSYNC ON via VSYNC OFF via that manual VSYNC'ing. This is like 1-frameslice beamracing (next frame response). This provides a quick way to enter/exit beamracing on the fly when conditions change dynamically. A Surface Tablet gets rotated, a module gets switched, refresh rate gets changed mid-game, etc.
Questions?
I'd be happy to answer questions.