Optimizing TIA rendering

thrust26 commented 7 years ago

The code seems to be optimizable. Simply moving the vblank() check to the beginning improves maximum framerate by 20..25% in my tests.

And maybe it would be better to retrieve to colors in priority order and stop retrieving when an object is enabled?

EDIT: Looks like my tests are not reliable. The difference seems much smaller than I first thought (<5%).

BTW: I noticed that at maximum frame rate, the sound gets way behind.

thrust26 commented 7 years ago

I have to look at the code again, but I think Blargg works in independent 7 horizontal pixel chunks. So the threading would have to work in columns, not lines.

Anyway, I mainly thought about phosphor.

DirtyHairy commented 7 years ago

Dirty Rectangles I think the only way to do this is to compare framebuffers at the pixel level as we are rendering them; anything else is more or less impossible. The new core is a simulation of the TIA at the counter level and lacks any concept of the general image composition. I think this is the reason why it is so precise: it doesn't try to establish any concept of geometrical object (and copies) and thus doesn't suffer issues in edge cases where these concept breaks down (retriggering etc.).

If anyone comes up with a smart algorithm to identify dirty areas by examining the framebuffer at the pixel level it could be done at that level, but I don't think the complexitiy is worth the speedup.

Threading I think the main danger with threading is the potential of data races. However, spreading the phosphor calculation over multiple cores shouldn't be too difficult. I would consider OpenMP for doing this; it takes away the overhead of spawning, pooling and managing threads. Basically, you just add pragmas to the loop that you want to spread over multiple cores and to the memory that is shared. I have used it in the past with FORTRAN, and it is pretty easy to add to existing code. I could take a stab at this at some point.

OpenGL I have thought about this in the past, too. A shader implementation of phospher should be easy enough to get going (I have implemented something similar in WebGL here, and I have it on the list of my TODOs for 6502.ts). I have no idea whether Blargg could be easily ported to a GPU, though, and the issues with portability remain.

sa666666 commented 7 years ago

The TIA class originally had two framebuffers, current and previous. They were used to implement phosphor in the old way (static palette only), but also in older versions to do exactly as @DirtyHairy says; track changes at the pixel level. Basically, each time a pixel was updated, an IF was done to check if something changed. It was eventually removed, and in general I can't see how checking every single pixel can somehow make anything faster. As soon as you've checked a pixel, you've done the work. The whole point was to not look at it in the first place :smiley:

I am of two minds on the move from OpenGL in Stella 3.x to the SDL2 abstractions in Stella 4.x and above. On one hand, it greatly decreased my workload and means the same code is used on all systems, with SDL itself picking the best possible toolkit for each platform.

On the other hand, it means not being able to take direct advantage of cool features of modern video card (shaders, hugely parallel hardware, etc), and how the algorithms would run much faster on such cards. I just wish there was one standard we could use. Windows is Direct3D only, and does everything it can to break other toolkits. OpenGL is mostly everywhere, but is also mostly substandard in Windows. And with Vulkan coming along, I'm not sure how long OpenGL will be around.

Many other emulators are taking the OpenGL-only approach and greatly benefiting from using OpenGL directly. But one of the design goals of Stella is to be able to run anywhere. Arrgh, I've had this conversation with myself many times over the years, and never really find a best solution.

thrust26 commented 7 years ago

Arrgh, I've had this conversation with myself many times over the years, and never really find a best solution.

Blame Microsoft!

sa666666 commented 7 years ago

Oh, I do, with the burning hatred of a thousand suns :smiling_imp: But at the end of the day, I'm still back where I started, and still have to figure out the best way forward.

thrust26 commented 7 years ago

Just for fun, I tried to implement multi threading for phosphor calculation. Code by Google. 😄 TIASurface_Multi.zip

DirtyHairy commented 7 years ago

Again, I'd strongly propose to use OpenMP: it will take care of spawning and pooling threads, of managing memory sharing and it will automatically select an appropiate number of threads to use in order to ensure optimal use of the hardware. Also, you don't need any boilerplate apart from adding pragmas to the loop that should be optimized.

sa666666 commented 7 years ago

How did it affect performance?

thrust26 commented 7 years ago

No idea, I don't know how to profile that. But CPU went well above 30% on my quad core at maximum framerate,

thrust26 commented 7 years ago

VS only supports OpenMP 2.0. 😞

DirtyHairy commented 7 years ago

I think 2.0 should be enough 😉 I can try a sample implementation later (propably not today though).

sa666666 commented 7 years ago

@thrust26, but what does the CPU do at maximum framerate before you added this code? The reason I ask is that sometimes threading is deceiving, and adding it makes things slower. I haven't tested or really looked in any detail, but I notice that the thread is using a lambda with '=' capture specification, meaning to copy everything. Now it may be required to do that, but it's also a possible speed hit.

As for OpenMP vs. rolling our own, there's a certain appeal to using C++11 threads directly. The biggest being that we don't add another dependency to the codebase (OpenMP). But I really don't have much experience in this area either way, so I will wait for your feedback.

thrust26 commented 7 years ago

@sa666666 I can't see any difference. Is there any way to display the framerate?

@DirtyHairy Is it reallly that easy? Use /openmp option and put #pragma omp parallel for above the loop?

EDIT: Nope, now the bottom half of the screen is not updated correctly.

thrust26 commented 7 years ago

As for OpenMP vs. rolling our own, there's a certain appeal to using C++11 threads directly

Agreed.

sa666666 commented 7 years ago

@thrust26, no there's no way to know the actual framerate at this time.

However, now that I think about it, threading isn't going to make things faster anyway, at least in terms of CPU usage. Inefficiencies aside, what we can do in one loop and 1 thread should be the same as 1/4 of the work each in 4 threads.

I guess this is more beneficial for lower-clocked systems, where it can't do all the work in the 16.7ms per frame and as a result get frames skipped. But overall, the work is still being done, just spread 'sideways'. This is different from our previous optimizations, which resulted in code that actually did less work.

sa666666 commented 7 years ago

I also want to add that threading has come a long way since I last looked at it. OpenMP seems dead-easy, but even the C++11 version is very simple compared to the last time I used it (C and pthreads). So my apparent disdain for threading is completely out of date.

thrust26 commented 7 years ago

For sure threading doesn't improve code performance. It just reduces execution time.

Do you have an overview on what platforms Stella is used? Especially at the lower end.

sa666666 commented 7 years ago

The 'big 3' (Linux, OSX, and Windows) on commodity hardware probably make up 95% or more of usage. As with most apps, Windows probably is the majority (I would estimate > 80%), and OSX and Linux the remaining. So on those systems, none of these optimizations are really necessary anyway. But IMO, we should always strive for maximum efficiency.

However, there are many emulation boxes that run Linux on lesser hardware. These tend to be ARM based, but I don't know if they have multiple cores. Probably most of the current ones do. The RasPI is a very common one. I've also been contacted by one of those 'emulation box' companies about using Stella is a fully-licenced GPL-approved way (more discussion by PM if you're interested). On that particular system, they note that their proprietary 250Mhz system is too slow, but a 400Mhz RasPI is sufficient. Of course this is the general emulation itself I'm talking about here, so that's without phosphor or TV effects at all.

I think getting full emulation, TV effects and phosphor mode (and the new sound core) running on a 400Mhz or above RasPI would be a good thing to aim for.

EDIT: But I don't want to sacrifice compatibility to achieve the RasPI goal. IMO, emulation accuracy trumps runtime efficiency.

thrust26 commented 7 years ago

Which RasPi model?

And yes, I am interested into that 'emulation box'. 😄

sa666666 commented 7 years ago

I suspect most people are trying it on the latest model; so version 3b. I guess this would be the one to aim for. But since it's running at 1.2GHz, I think it's not a problem anyway. The really low-end ones are the ones used in the emulation boxes, which is generally why they have crappy emulation.

Anyway, I have to step away for a few hours; have to get some work done.

DirtyHairy commented 7 years ago

@thrust26 it's been some 5 years since I last used OpenMP, but basically, yes 😉

thrust26 commented 7 years ago

I optimized the phosphor loops by unrolling (2x and 8x, ~130 bytes each) and smoothing the code. TIASurface::render() went from ~11.2% to ~9.0% in profiling.

TIASurface_unrolled.zip

DirtyHairy commented 7 years ago

@thrust26 I think what is missing for OpenMP is pragmas to define the memory semantics. Iirc, the default is to give each thread its own copy of all quantities, that's why the lower half of the screen is not updated.

sa666666 commented 7 years ago

@thrust26, it's unclear to me why you use 2 and 8; what is the significance of those magic numbers? Or more to the point, why they aren't the same (since the two blocks of phosphor code are essentially the same)??

thrust26 commented 7 years ago

I changed the code into this. It works better than before, but now there are distortions. Any suggestion?

      Int32 bufofs;
      // Note: The codes assumes that AtariNTSC::outWidth(kTIAW) == outPitch
      #pragma omp parallel for
      for (Int32 y = 0; y < height; ++y)
      {
        bufofs = y * AtariNTSC::outWidth(kTIAW); 
        for(Int32 x = AtariNTSC::outWidth(kTIAW) / 8; x > 0; --x)
        {
          // Store into out buffer and back into displayed frame buffer (for next frame)
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
        }        
      }

Update: I found a working solution. It seems that the inner loop can be split between threads too (the documentation says different). Unfortunately the solution requires some quite frequent extra math. I need a pragma that prevents splitting the inner loop. EDIT: Nope, see below!

      Int32 bufofs;
      // Note: The codes assumes that AtariNTSC::outWidth(kTIAW) == outPitch
      #pragma omp parallel for
      for (Int32 y = 0; y < height; ++y)
      {
        for (uInt32 x = 0; x < AtariNTSC::outWidth(kTIAW) << 3; ++x)
        {
          bufofs = y * AtariNTSC::outWidth(kTIAW) + (x << 3);
          // Store into out buffer and back into displayed frame buffer (for next frame)
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          rgbIn[bufofs++] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
        }        
      }

thrust26 commented 7 years ago

@thrust26, it's unclear to me why you use 2 and 8

I found that more than 8 times unrolling doesn't give any relevant improvement. And since Blargg lines are 3.5 times wider than non-Blargg lines, I thought less unrolling (8/3.5 = ~2) would be OK. But you can unroll as often as you want.

sa666666 commented 7 years ago

Isn't it possible that this loop unrollling will start to interfere with threading? Maybe we should test both methods before doing any further commits.

@thrust26, so the 2 and 8 are essentially from empirical evidence?

sa666666 commented 7 years ago

@thrust26, each line now depends on the other one. In particular, the value for bufofs in one line is incremented and then used in the second line. So there is now dependency between the lines.

Also, on my compiler at least, your code (without threading) is complaining about sequence points. I think that's related to doing an increment and using the bufofs variable in the same line.

thrust26 commented 7 years ago

@sa666666 Yes, 8 is empirical (2 is only deducted from 8).

As far as I can tell, the inner loop is not split inside. So the unrolling is fine. But the the inner loop itself is split too. That caused the problems. Fixed above.

sa666666 commented 7 years ago

@thrust26, not related to threading, but you need to split each line from

rgbIn[bufofs++] = out[pos++] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);

to

rgbIn[bufofs] = out[pos] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
bufofs++;  pos++;

Otherwise, you're actually violating a rule of the language. In particular, it's not guaranteed that result of '++' is done before or after it is used later in the line. I don't know how much, if any, this will affect the performance.

thrust26 commented 7 years ago

I wondered about that ++ too. It works, but I will better change that.

Meanwhile I solved my problem. Here is the code which only threads the outer loop and causes no problem (bufofs has to be local inside the outer loop). I think this is more effective than doing the math in the inner loop and allow thread splits there too. But I can't measure it.

      // Note: The codes assumes that AtariNTSC::outWidth(kTIAW) == outPitch
      #pragma omp parallel for
      for (Int32 y = 0; y < height; ++y)
      {
        Int32 bufofs = y * AtariNTSC::outWidth(kTIAW);
        for (uInt32 x = AtariNTSC::outWidth(kTIAW) / 8; x; --x)
        {
          // Store into out buffer and back into displayed frame buffer (for next frame)
          rgbIn[bufofs] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          bufofs++;
          rgbIn[bufofs] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          bufofs++;
          rgbIn[bufofs] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          bufofs++;
          rgbIn[bufofs] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          bufofs++;
          rgbIn[bufofs] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          bufofs++;
          rgbIn[bufofs] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          bufofs++;
          rgbIn[bufofs] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          bufofs++;
          rgbIn[bufofs] = out[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
          bufofs++;
        }
      }

thrust26 commented 7 years ago

Ouch! When using threading, the CPU utilization goes to ~70%!!! 😨 (usually 4% only)

Any idea what's happening? Too much overhead from OpenMP? Or maybe cache problems?

sa666666 commented 7 years ago

The '++' only works in that Visual Studio doesn't complain about it, and it happens to work. But it is against the spec. I typically compile in Visual Studio, gcc and clang when testing new features, and there are quite a few times that one compiler picks up something that the other doesn't.

As for testing the efficacy of the new approach, let's see what @DirtyHairy has to say. He seems to have more experience with OpenMP and threading in general, so maybe he can suggest a possible test.

sa666666 commented 7 years ago

Yep, that's what I was afraid of with threads. We sit down and hammer out an approach. Everything seems to fine, make logic sense, etc. And then test it, and it ends up being slower :disappointed:

thrust26 commented 7 years ago

True. But my hand written code works nicely. So we are fighting the frame work here, not the threading in general.

sa666666 commented 7 years ago

When you say "hand written code", I assume you mean the version using C++11 threads directly?

Also, again I'm not sure about exactly how threading works, but isn't it the same as calling a bunch of functions? So maybe there's the function call overhead?? :confused:

thrust26 commented 7 years ago

Yup, my Google code. :smiley:

Yes, there is function overhead. And some math overhead in our example too. So the total CPU time will increase. But at least in my own, handwritten code, the overhead is minimal.

BTW: Having a chat channel as @DirtyHairy suggested would be really nice.

sa666666 commented 7 years ago

You're only using 2 threads, and OpenMP is probably using every thread on your system (4 to 8, depending on your CPU; 16 for me :smile:). Maybe that has some effect, like going over a certain number makes it worse. It wouldn't be the first time that heard of x threads being fast, but x+1 being slower.

As for the chat channel, how/where is that set up?

thrust26 commented 7 years ago

I just tried to tell OpenMP how many threads to use: 1 thread = 6%, 2 threads = 30%. So CPU utilization already jumps with just one extra thread. I more and more like my own code. 😆

@DirtyHairy mentioned a chat in the trackball thread:

As a sidenote, @sa666666 , I've never tried it before, but gitter is a chat that is connected to github projects and might be more useful than tickets for discussions like this :smirk: I'm tempted to do an experiment and set it up of Stella, if this is OK for you.

DirtyHairy commented 7 years ago

@sa666666 You can set up gitter on the web page. I just tried to do so, but I would need administrative access for the stella-emu organization :smirk:

Concerning the threading: in my experience, if used correctly, the overhead from OpenMP is negligible. It mainly consists of the bookkeeping for spawning and pooling threads and for distributing workloads. I have used OpenMP for numbercrunching before, and the 70% CPU are certainly not normal. What I'd like to do is to try my own implementation of OpenMP in the phosphor code --- not because I dislike Thomas', but I want to look at the code with fresh eyes :smirk: I'll see whether I can come up with something in the next hour.

thrust26 commented 7 years ago

I tried threading the outer loop only, the inner loop only or both loops. Always the same CPU load.

Now I want to know your results before I go to bed. 😄

thrust26 commented 7 years ago

@sa666666 16 cores? Or 8 cores with Hyper-Threading? What CPU is that? Ryzen?

sa666666 commented 7 years ago

I just created the gitter thing and invited both of you. I'm not sure how to use it yet, or if it's set up correctly. Also not sure how it improves on discussing things here, but we will see ...

@thrust26, Ryzen CPU, 8 courses + hyperthreading = 16 total. Stella compiles in 25 seconds :grinning:

I'd also like to know what's going on wrt threading. But I'm soon logging off too. It's almost 7PM here, and I've been working on Stella today longer than my actual job :smile:

thrust26 commented 7 years ago

Otherwise, you're actually violating a rule of the language. In particular, it's not guaranteed that result of '++' is done before or after it is used later in the line.

Would that be OK or is it still undefined? I would assume that the expression is always evaluated first, before the result is assigned. And the code looks nicer that way. :smiley:

rgbIn[bufofs++] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);

DirtyHairy commented 7 years ago

That's what my code looks like:

    case Filter::Phosphor:
    {
      uInt8*  tiaIn = myTIA->frameBuffer();
      uInt32* rgbIn = myRGBFramebuffer;

      #pragma omp parallel for
      for(uInt32 y = 0; y < height; ++y)
      {
        const uInt32 offset = y * width, offsetOut = y * outPitch;

        for(uInt32 x = 0; x < width; ++x)
        {
          const uInt32 bufofs = offset + x;
          const uInt8 c = tiaIn[bufofs];
          const uInt32 retVal = getRGBPhosphor(myPalette[c], rgbIn[bufofs]);

          // Store back into displayed frame buffer (for next frame)
          rgbIn[bufofs] = out[offsetOut + x] = retVal;
        }
      }
      break;
    }

It works fine, but, unfortunately, I see the same as @thrust26 : there is a constant load of 20% per core if OMP is active., even if I completely gut the loop to its bare skeleton. I must admit that I have no idea what is going on here, I have used OMP before without problems :angry:

The only difference to my previous applications was that I was parallelizing applications that never slept and that were trying to crunch number as fast as possible. I suspect that the issue is related to Stella sleeping between frames: sleep only puts the main thread to sleep, so it could be that the other threads are caught in a spinlock. I'll do a profile out of curiousity when I find time, but I think that's it for OMP in our usecase :white_flag: Sorry for the detour.

DirtyHairy commented 7 years ago

Would that be OK or is it still undefined?

Afaik it is undefined, but @sa666666 is the authoritative source on that :smirk:

thrust26 commented 7 years ago

We should not give up that early. Maybe there is a configuration for OpenMP which changes the behavior?

I tried setenv("OMP_WAIT_POLICY", "PASSIVE", 1) (the default is ACTIVE and that seems to keep the threads alive and busy for a while), but that didn't make it any better. 😢 Though M$ says different e.g. here.

And this seems to indicate that OMP_WAIT_POLICY is not supported in VS 2017.

thrust26 commented 7 years ago

@DirtyHairy Another idea (for TIA class): In TIA::renderPixel() we now check the objects in priority order. This means that low priority objects are not checked when they are not visible. Could this be used to optimize the tick() methods?

Maybe, if we reorder e.g. the tick() calls like in TIA::renderPixel() we could at least save calculating collision if the object is hidden. And maybe more?

The savings won't be big (if any), but maybe its worth a look.

sa666666 commented 7 years ago

Afaik it is undefined, but @sa666666 is the authoritative source on that :smirk:

I wouldn't say authority, it's just that I make sure to compile everything on 3-4 compilers, to catch any possible issue. Based on my understanding, the sequence point issue is because you're using variable on both sides of the '=' sign, but on one side it is doing a post-increment. So saying x = x++ would also complain.

So the easiest way to fix it is:

rgbIn[bufofs] = getRGBPhosphor(out[bufofs], rgbIn[bufofs]);
bufofs++

thrust26 commented 7 years ago

OK then.

BTW: Instead of copying into out[]one by one, I am using memcpy() now. 😄

stella-emu / stella

Optimizing TIA rendering #185