stella-emu / stella

A multi-platform Atari 2600 Emulator
https://stella-emu.github.io
GNU General Public License v2.0
612 stars 113 forks source link

Move post processing (Blargg and Phosphor) from main thread #410

Open DirtyHairy opened 5 years ago

DirtyHairy commented 5 years ago

Post-Process frames in a pipeline on one or more separate threads as soon as they are available, not on the main thread. This will help slow systems, reduce the visible effect of dropped frames (they are still visible by their phosphor trace) and reduce temporal aliasing artefacts like #409 .

sa666666 commented 5 years ago

I thought by having multi-threading enabled for Blargg that it is already outside the main loop?

thrust26 commented 5 years ago

Same here. 😕

DirtyHairy commented 5 years ago

While the calculation Blargg is spread over separate threads, the main thread still blocks until the calculation is complete --- the image will be rendered only after Blargg is done, so the time required for Blargg (albeit split among several threads), phosphor and swapping the buffer adds up to the total time budget of rendering a frame on the main thread.

In addition, the way the main loop is currently is designed such that the emulation worker continuously generates frames in real time, and the main thread samples them and blits them to screen. If the main thread takes too long, a frame will be dropped, which currently means that it will not enter post processing, so not even a phosphor trace will be visible. This is basically an aliasing artefact: you are sampling a set of data points with a different frequency. With this ticket I want to improve on two ends:

sa666666 commented 5 years ago

Hmm, ends up sounding much more complicated than I thought. And all this work for the R77 only, since it's likely irrelevant for any desktop system currently in use.

I appreciate the hard work and effort, but begin to wonder how much we can dedicate to the R77 specifically.

DirtyHairy commented 5 years ago

To be clear: I don't think this is specific in any way to the R77. I have had this in my mind for quite some time (basically since reworking the scheduler). I think the improvement will be there on every system that is slow, has problematic video drivers or a unlucky refresh rate (think of 60Hz NTSC games on a 50Hz display).

DirtyHairy commented 5 years ago

I should add that I don't think this is required for solving #409 or releasing a R77 firmware 😏

thrust26 commented 5 years ago

Hm, I thought we already had the threads parallel as you describe for the future. I suppose that's from the multi-threading discussion.

For doing the phosphor, I am not sure when exactly this should be done. If the main thread already takes too long, how should we have time for phosphor on top?

DirtyHairy commented 5 years ago

My idea is to have phosphor on Blarrg on a separate postprocessing thread. As soon as a frame becomes ready on the emulation thread, the postprocessing thread will start to process it. At this point, the main thread can still be busy swapping buffers --- at the moment, post processing is done immediately before swapping the buffers and never in parallel.

thrust26 commented 5 years ago

Sorry, I don't get it.

Phosphor is applied on top of Blargg, how should this be done in separate threads? Or should we create Phosphor in one thread and Blargg+Phosphor on a 2nd thread and then decide which one we use?

And how should we start postprocessing before the buffers needed are there?

DirtyHairy commented 5 years ago

I guess I am not doing well at explaining what I have in my mind 😏

As soon as a buffer has been generated by the TIA, it can be processed by Blargg. As soon as it has been processed by Blargg, it can be processed by phosphor. As soon as it has been processed by phosphor, it can be rendered to the screen. And all these steps can happen simultaneously, provided they suitable input is ready.

At the moment, a frame can only enter the "postprocessing chain" (Blargg + phosphor) when the last frame has been rendered to screen, as both steps happen on the main thread. And if rendering takes too long (and it takes macroscopic time, because rendering is synced with the screen refresh), then frames that have been generated by the emulation worker in the meantime will be lost (this is what I mean with "frame dropping"). If post processing would happen on a separate thread, then they would at least have contributed to the image via their phosphor trace, and the dropped frame would be less noticeable.

thrust26 commented 5 years ago

Getting closer to understand you. :smile:

But I am not sure if that really helps. Because you still have 1/60s for the whole process. And if the overall process is too slow, you will definitely have to skip frames.

So now you have three options:

  1. accelerate the process, so that if fits into 1/60s
  2. parallelize
  3. make frame dropping less noticeable

As soon as a buffer has been generated by the TIA, it can be processed by Blargg. As soon as it has been processed by Blargg, it can be processed by phosphor. As soon as it has been processed by phosphor, it can be rendered to the screen. And all these steps can happen simultaneously, provided they suitable input is ready.

I cannot see how this can be done in parallel any better than now. If I e.g. want to do Blargg and Phosphor in parallel, I can only process image n with Blargg and apply Phosphor to n-1. And I can render n-2 to the screen. The result is lag.

Or I split the image into smaller pieces and use the CPU power there (that's what multi threading is doing already).

And you describe option 3. in your last paragraph. But if phosphor is the last step, how can it contribute when the post processing is too slow?

DirtyHairy commented 5 years ago

I guess I may be miscommunicating my priorities with this issue 😏

My main concern with this issue is not performance, but ensuring that every frame enters the phosphor calculation, even if it is skipped. I think this will lead to a much smoother image on systems that skip on a regular basis.

In order to do this, post processing has to be moved from the main thread to a separate thread: as long as it runs on the main thread, a skipped frame will skip phosphor, too. As a side-effect, this may offer potential performance improvements, but those are not what I am aiming at, and whether they exist at all depend on the video driver.

It is true that there is only 1/60s between two frames. However, the main thread wastes a sizeable part of this time sleeping while the video driver waits for vsync and swaps buffers. The details of this delay depend on the driver implementation, but there is always some blocking involved. In a single iteration of the main loop, the emulation worker currently runs while the main thread does post processing and drawing. If this both together take too long, it will emulate another frame, and the result of the first run will be skipped. Moving post processing away from the main thread will improve this as Blargg and phosphor can run even while the frame is being drawn. Whether this actually leads to a performance improvement depends on the driver implementation.

The possibility to arrange Blargg and phosphor as a pipeline and run them on two separate threads is only an afterthought and may not be worthwhile. It would only help if Blargg + phosphor take longer than 1/60s, and the price would be one frame of lag. I think this price would be acceptable, but I don't think this is a realistic situation, as emulation much more expensive than Blargg on phosphor.

thrust26 commented 5 years ago

I think the problem is on my side. I cannot see how you enter phosphor calculation when a frame is skipped. Maybe lets start to make me understanding this first.