mfld-fr / emu86

Intel IA16 emulator for embedded development
35 stars 6 forks source link

Refine MDA attributes and add IPS option #67

Closed mfld-fr closed 3 years ago

mfld-fr commented 3 years ago

Move the attribute color logic out of the character rendering function, and make the BIOS attribute agnostic. Also little cleanup of the PC/XT/AT BIOS, and new option to enable IPS. Tested with FreeDOS HELP (8086tiny version) in both 3 and 7 modes.

ghaerr commented 3 years ago

A quick review of this change seems to show that you're needlessly adding lots of extra processing in a bottleneck graphics routine (especially with the MDA code rewrite, when the original worked fine and is much faster) that is called thousands of times with every scroll operation... for EMU86 it will just slow things down greatly, and for any deployed graphics output would be unacceptable. But that's ok, it is your project :)

Separately, unless I'm missing something, I don't understand the change to automatically convert attribute 0x00 to 0x07 late in a display routine. Any PC program can use attribute 0x00 (passed to INT 10h AH=0Eh, or in scroll/clear functions) to stop characters from being displayed. This change breaks that and is strictly incompatible. And, of course, your rewritten MDA routine is now non-compliant with the specification you mentioned in https://www.seasip.info/VintagePC/mda.html.

Summary - moving the attribute logic mapping out of the display function is a good idea, but the forced attribute mapping from 0 -> 7 is incorrect, and I don't think the MDA rewrite improves anything. Overall graphics output is slowed down needlessly. Finally, cursor code and entire PR needs careful testing on ELKS vi to ensure compatibility.

ghaerr commented 3 years ago

Also, because I'm a git newbie, my original working code in #64 didn't get committed to master. With all the other changes going on, and your rewriting of my code before even being committed to master, I will find it hard to test exactly what might be broken. I would much prefer that my original submission be committed, so I have an easier fallback for testing on ELKS with your rewrites.

Thank you.

mfld-fr commented 3 years ago

How to test a branch before it is merged to master:

mfld-fr commented 3 years ago

Agree that using 0 as the implicit default attribute is not a good idea whenever 'black on black' is actually used by some programs to hide output... let me rework this...

ghaerr commented 3 years ago

Thank you for the git help, I know most, but not all, of that. I usually understand enough to pull changes from PRs and test, but, as you saw with branches and squashing, it is beyond my level of git knowledge. I am sorry for the mess on #64.

My approach to writing code is a bit more incremental then yours... sometimes, smaller changes are better, even though they they don't complete the project at hand. I suppose that's a problem for a maintainer, e.g. when to accept smaller, incomplete changes.

Nonetheless, I still would prefer you add my original scrolling EGA/MDA commit to master so that we have a saved working reference that I have tested thoroughly, and then have your rewrites posted as PRs, possibly incrementally, so that we don't cause ourselves extra work testing, while you refine the architecture and design you require for EMU86. It is far easier for me to rollback a commit on master than pull prior branches, when trying to debug rewritten code.

Finally, while I see that you desire the idea of keeping the attribute separate from the character code value: I think that is a great idea in BIOS code (and thank you for changing my code there in previous commits), but I'm not sure its as good an idea inside the SDL backend, where video RAM is mostly looked at as word_t's, not byte_t's. Also, my graphics experience shows that we need to be even more careful in graphics output designs, with an emphasis for speed (even though EMU86 is NOT built nor designed for speed, I get it).

Thank you!

mfld-fr commented 3 years ago

I am sorry, but I don't see where could be the performance regression ? I just watched the IPS while making the screen to scroll in both mode 3 and 7, and I noticed no sensible difference compared to latest next. Could you please point out what are the lines that could slow down the things greatly, and that I might have missed ?

I agree that, on a slow CPU, accessing the video memory by words is faster than by bytes, but EMU86 is running on fast CPU with a big cache. Not only the memory is accessed by cache lines; but also the cache is large enough to hold all the emulator working set, including the emulated video memory (L2 : 6 MB on my Core i3). This is why I prefer simple code without bit combining & masking, rather than optimization without sensible impact on the user. Again unless I missed something ?..

ghaerr commented 3 years ago

There are still bugs with regards to the default attribute in con-sdl.c with your changes.

More points on the MDA implementation, which it seems you insist on using a routine you wrote a few days ago, which I see as problematic:

IMO, the attribute conversion routine should stay simple, and remain in a single routine, switched on video mode. Very special cases currently "handled" aren't in fact being tested. (We need a user program that writes all attribute values for that - I have one for EGA color, but it uses ANSI escape sequences, not INT 10h AH=0EH). Computing RGB values directly should be changed.

Please test all changes with ELKS vi and kilo for color, as I'm the one who is likely going to have to be maintaining this for complex cases.

ghaerr commented 3 years ago

I am sorry, but I don't see where could be the performance regression ? I just watched the IPS while making the screen to scroll in both mode 3 and 7, and I noticed no sensible difference compared to latest next. Could you please point out what are the lines that could slow down the things greatly, and that I might have missed ?

I will post, but quick review showed lots of added shifts and multiplies within inner for loops, which is always a bad idea for graphics or intensive I/O routines. The offender in MDA was four extra multiplies by MDA_WHITE, which I now see would be optimized out by compiler. Still bad idea, see above post. Agreed that for fast systems, no one will likely notice, but I still don't like see of the changes.

I agree that, on a slow CPU, accessing the video memory by words is faster than by bytes, but EMU86 is running on fast CPU with a big cache. Not only the memory is accessed by cache lines; but also the cache is large enough to hold all the emulator working set, including the emulated video memory (L2 : 6 MB on my Core i3).

Agreed, @mfld-fr, but consider the case when running interpreted in a browser, where EMU86 is compiled into WASM, which is also interpreted, and then the browser converts the entire SDL backend output into its own graphical output, etc etc. This is very much slower than desktop PC!! With graphics, you'll find this is usually the case - there are layers and layers depending on exact platform, so I always try to be very sure to write fast code with graphics, or you get a slow result and get to rewrite for next platform. I wrote all the SDL backend originally for the browser only (which is in fact too slow already), and here we are now using it for an enhanced mainline EMU86 feature, which will also slow down browser emulation more.

So - my comments aren't just about word vs byte lookups, but about the overall bottlenecked design speed of our graphics pipeline.

This is why I prefer simple code without bit combining & masking, rather than optimization without sensible impact on the user. Again unless I missed something ?..

Admittedly, I didn't see the benefit of splitting character and attribute in the BIOS code initially, but now think it makes for more readable and understandable BIOS. I do not see the benefits, particularly in the SDL backend code, where, when we are both done with it, should be completely independent from the BIOS code and built for high speed. I'm not even done with the BIOS/SDL implementation, and we're already in the middle changing it. All BIOS ram manipulation needs to move up out of SDL into BIOS level for strict emulation compatibility. But that's another subject.

mfld-fr commented 3 years ago

Mmm... yes, I agree, the browser hosting is the worst case, and I have no benchmark and no performance data in that configuration to build my opinion. Could it be possible to use the IPS facility to produce some data when running in a browser ? That would help me to understand the magnitude of the performance impact in that use case.

I also started to profile EMU86, and I think, rather than the MDA attribute routine, you could gain a lot by not rendering the character matrix when that character is null or space, but just filling the bytes with the background color, to save many loops ?

ghaerr commented 3 years ago

Quick review of latest version:

The MDA attribute routine is much improved from your first revision: thank you for listening and realizing that my original MDA code was both fast and correct, I appreciate it!! :) I suppose I can't complain about an RGB table not being used, since it appears you want to display in monochrome, and the MDA routine is now faster than the EGA routine. (Removing the switch statement greatly improves the speed, due to branch prediction not possible using switch on modern cached processors).

Line 280 in con-sdl.c still incorrect - use ATTR_DEFAULT. Line 812 of rom-pcxtat.c should default to EGA (3), not MDA (7).

I'd like to see some other speedups, but can wait on that.

Mmm... yes, I agree, the browser hosting is the worst case, and I have no benchmark and no performance data in that configuration to build my opinion. Could it be possible to use the IPS facility to produce some data when running in a browser ? That would help me to understand the magnitude of the performance impact in that use case.

I suggest you first compile up the Emscripten version, given the instructions I added to README.md. The browser version will output SDL in the top (square) window, and printf output will appear in the lower (horizontal) window. I haven't yet added the ability to read from the lower window, but that should not affect your IPS output. I'm not sure signals are handled in Emscripten/browser, but you might get IPS to work in another way, which would be great.

I also started to profile EMU86, and I think, rather than the MDA attribute routine, you could gain a lot by not rendering the character matrix when that character is null or space, but just filling the bytes with the background color, to save many loops ?

Yes, that is a possibly a good idea, but would require a slightly different approach to the current pipeline: that is, the entire screenful the bounding rect of non-blank characters would be redisplayed each major cycle, after painting the bounding rect background once. (And yes, in this case it would be much faster to do a word compare of attribute and character!) All the code for maintaining a bounding rect would also be unused and thrown out. I have used this approach before, but it is quite intensive, could work when typing single-characters-at-a-time, like in an editor, where just two characters need to be updated. So this needs to be thought through a bit before jumping into it, IMO. The current implementation is optimized for smaller changes, but is still fairly fast for whole-screen (or any scrolled) output. A problem with this approach occurs when the overall screen background color is not the same as the ATTR_DEFAULT color (such as when a program-wide option is added to map attributes to non-standard colors, and the background chosen is not black). A big speed benefit of this idea could also be not drawing the background bits of each character (similar to the current OR operation), when the background color is already the same as the cleared-to background color. This would be the case when drawing any character whose BG attribute (0x70) == 0 (DEFAULT_BG).

If thinking of going this route for speed, we should probably rename the 'or' flag to 'nobkgnd' (don't draw background) in sdl_drawbitmap, as I realize now it is mis-named - it doesn't OR anything, it just skips the background pixel draws, which is exactly what will be needed for part of this optimization.

mfld-fr commented 3 years ago

Line 280 in con-sdl.c still incorrect - use ATTR_DEFAULT. Line 812 of rom-pcxtat.c should default to EGA (3), not MDA (7).

Fixed. Thanks for the review !