Random ideas to reduce code complexity, overhead

jhhoward / MegaDOOM

Megadrive DOOM BSP renderer

28 stars 1 forks source link

Random ideas to reduce code complexity, overhead #2

Open maxxoccupancy opened 7 months ago

maxxoccupancy commented 7 months ago

src/i_video.h looks like it contains code for polling the mouse, which I don't believe happens on these consoles.

There is a look up table that right shifts a 19 bit number down to 13 to get results back from a LUT. The 68000 is a bit slow performing these simple shifts. 19 shifts within the register would give us something like 114 cycles lost every time we attempt this. It's faster to just read the upper 16 bits of a register even if that means a LUT of 64K entries, or 128KB. Even if we used two or three of these, maximum cartridge size is 15MB. Unless we're planning on putting this game on a cart that has the 4MB limit, it might save us quite a bit of time to skip this step. "Once converted to angle screen space X coordinate are retrieved via a lookup table (viewangletox). Because BAMs were int, angles were first scaled down from 32 bits to 13 bits via a 19bits right shift in order to fit the 8k lookup table." https://fabiensanglard.net/doomIphone/doomClassicRenderer.php

Because the 386 and 486 have ICache's, DOOM's original code draws the image rotated 90 degrees. It then rotates the image and displays that to the screen. Since the Genesis/MD has no caching in the system, I wonder if we could skip this step--if that would speed up the transfer of data from RAM->VRAM.

The Genesis/MD has a rarely used 256x224 video mode, increasing the amount of time available in HBlank, IIRC. The VDP even supports the only Master System 256x192 mode. I don't know enough about those old modes to know if that would buy us any more time on the data transfer. MegaDOOM is only using 144x100, IIRC.

There are still a number of examples in the code where it looks like "int" is being used instead of 16 bit int.

New C compilers support features like Boolean (only uses one bit and can often be optimized away) and constexpr. I see a few functions here and there with this might be used. uint16 also sometimes creates opportunities for optimizations that would otherwise not happen with int.

I still believe we need to test the code out using -Ofast for compiler settings. We won't know if this in an option unless we compile this way and test out the edge cases.

inline functions. Compilers (especially older ones) can't always find these.

For the inline assembly, I am not seeing a hard and fast reason to use 'volatile,' which only prevents more aggressive optimization by the compiler. https://gist.github.com/flamewing/ad17bf22875be36ad4ae26f159a94f8b

This version of the code looks quite old. I don't see any use of keywords like 'unreachable' or 'static_assert' or 'assume.' Those have been shown to shave off an instruction or two in real world code.

My gut feeling is that there's some register contention and spill-reloads occurring, since the 68k only has 16 registers. Anything that cuts down the number of variables or even their size can reduce the number of unnecessary 32-bit loads and spills.

I'm going to go over the code once more to see if I can find some unnecessary sections of code that just lead to bloat. Sometimes the Genesis/MD sees a slight speedup if a game can cut down on the number of 512KB/1MB segment boundariess (I can't remember which one) that have to be crossed at runtime.

C++20 has a ton of new features that allow for more optimizations and performance tricks. For example, C++ compilers can make more aggressive use of 'constexpr' than C can, especially for all of these hash tables. I'm not sure if SGDK is compatible with C++, but it may be worth a shot to make use of this. Any feedback on these issues and questions would be greatly appreciated.

I'm sure that I saw one other major potential time/code saver. I just can't recall it right this minute.

maxxoccupancy commented 7 months ago

Profiling reveals what's wolfing down our CPU cycles: 29.2% R_DrawColumn 20.8% R_DrawSpan 7% R_RenderSegLoop 5% R_MakeSpans 3.4% R_GetColumn About 65% total for the main drawing routines. Checking other versions of the source code to see if they've come up with other speed tricks.

Another recoding that I discovered while doing my research. I haven't checked yet to see if the program has implemented this faster method of searching has been implemented here.

Trivia : Visplanes hardcoded limit (MAXVISPLANES 128) was a plague for modders as the game would crash and go back to DOS. Two issues could arise: "R_FindPlane: no more visplanes" : The total number of different visplanes materials (height, texture and illumination level) is over 128. R_DrawPlanes: visplane overflow (%i) : Visplanes fragmentation is important and number of visplanes is over 128. Why limit to 128 ? Two stages in the renderer pipepline requested to search in the list of visplanes (via R_FindPlane), this was done via linear search and it was probably too expensive beyond 128. Lee Killough later lifted this limitation, replacing linear seach with a chained hash table implementation.

ehaliewicz commented 7 months ago

There are no visplanes in this megadrive version, as there are no textured floors or ceilings.

Visplanes are used to merge ceilings or floors that have the same texture, light level, and height relative to player, so that perspective and lighting calculations can be done once for merged sets of visplanes. Here, floors and ceilings are drawn with single color columns, one for each wall or portal column.

As for the framebuffer drawing order, this code also does no rotation, but drawing to contiguous columns would save 4 cycles per byte/word framebuffer write (at 208x144 resolution, 14688 bytes, about 7.6ms per frame). Halving the resolution would let us get down to 8344 word writes. and if you changed the framebuffer layout a bit, we could write them in chunks of two writes for 38 cycles, or 4.75 cycles per pixel. That would be a HUGE savings of around 22ms per frame!

maxxoccupancy commented 7 months ago

Wait, 22ms? Most of the game is running around 4-6 fps (I count 50 frames in 10 seconds) right now, so, 110ms? A 10% speedup from that one change? Definitely worth it.

To be fair, my background is RISC Assembly and mostly higher level languages than C, so I'd feel more comfortable if you made the changes. I can propose different code, but you obviously understand the DOOM engine better than I do. I will do some recoding in VSC or Code::Blocks in the near future just so I can see if performance improvements materialize.

IIRC, DOOM on the SNES with the FX chip used a trick on the planes giving a simple, fixed background. For the Genesis/MD, we'd have to stick with the outdoor image for the background, then used the fix image for the foreground, allowing us to cut out window and outdoor areas using the transparent color. That's easier said than done, but it's certainly been done here: https://youtu.be/JqP3ZzWiul0?si=M3xab39QFvwTZQN2

It looks like we're already running in 256x224 (or "32 column) mode. These systems are not hurting for bandwidth, which is why SEGA still beat SNES with older hardware. So I won't go down the DMA rabbit hole unless you think that there are a few more milliseconds to be saved.

maxxoccupancy commented 7 months ago

And I'm still not seeing why we need to use the keyword volatilewhen inlining Assembly. In my experience, we've always wanted to just include a group of simple Assembly instructions, but then let the linker and optimizer shuffle those instructions around to find peephole optimizations that we often miss in the code.

"5.4 Volatile ...? If you are familiar with kernel sources or some beautiful code like that, you must have seen many functions declared as volatile or volatile which follows an asm or asm. I mentioned earlier about the keywords asm and asm. So what is this volatile? If our assembly statement must execute where we put it, (i.e. must not be moved out of a loop as an optimization), put the keyword volatile after asm and before the ()’s. So to keep it from moving, deleting and all, we declare it as asm volatile ( ... : ... : ... : ...); Use volatile when we have to be verymuch careful. If our assembly is just for doing some calculations and doesn’t have any side effects, it’s better not to use the keyword volatile. Avoiding it helps gcc in optimizing the code and making it more beautiful. In the section Some Useful Recipes, I have provided many examples for inline asm functions. There we can see the clobber-list in detail." https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html#ss5.4

Modern compilers are also very good at register allocation and preventing spills. If the compiler has an idea to save a few cycles, I say let it do as much as it can.

My preference is to give the compiler a little assist and sprinkle in inlineand assume where we can, cutting out of few unnecessary tests and branches in the final code. [[assume(x > 0)]]; Modern C compilers also let us use constexpr, and I've seen some instructions cut out at Godbolt.org when this is done.

ehaliewicz commented 7 months ago

Wait, 22ms? Most of the game is running around 4-6 fps (I count 50 frames in 10 seconds) right now, so, 110ms? A 10% speedup from that one change? Definitely worth it.

It's worth it, but not a trivial change and this isn't my project. I have my own projects to work on, so it's not likely I'll do it myself.

And I'm still not seeing why we need to use the keyword volatile when inlining Assembly. In my experience, we've always wanted to just include a group of simple Assembly instructions, but then let the linker and optimizer shuffle those instructions around to find peephole optimizations that we often miss in the code.

gcc cannot reason about inline assembly blocks, aside from the registers you've marked as output or clobbers, I don't trust it to be able to move my inline assembly blocks around without causing bugs, and as all of the inline asm blocks I wrote for the PR have side effects, volatile is appropriate.

maxxoccupancy commented 7 months ago

I am so deep into this project now that it's distracting me from my coursework (which ironically does cover BSPs) that I have no problem delving a bit deeper in to the code. As I've said before, I've done RISC Assembly, but not performance 68k Assembly, and only a little bit of performance C coding.

I'm seeing about a dozen potential optimizations that could get us (fingers crossed) into the 10+ fps range before adding sprites--which should be free on the VDP since we've got VRAM to spare.

If you'd be willing to continue following along on this project and guide it to the end, I'm fairly confident that we can completely change the way that the world looks at retro hardware and may even see a renaissance in retro projects by CS students.

If we could block out the movement of textures from the ROM cart to the CPU to RAM/VRAM, I could start counting cycles and figure out how to cut down on overhead.

I.e., this idea I've had of simply getting the DMA to access textures on ROM directly and load them into a frame buffer in VRAM. The Video Display Processor and DMA in tandem work like a 13MHz 16-bit RISC processor. While we don't have DOOM's minimum system requirements (386 & 4MB) in the 68k and Z80, the total processing power and throughput are theoretically there.

If you're willing to tweak the code, I'm willing to test the hell out of your code and try my own tweaks--to see if it works attempting every little performance optimization.

maxxoccupancy commented 7 months ago

Jason Turner of C++ fame also poured over the DOOM code looking for potential improvements while porting DOOM to C++. They lost a few fps, IIRC, but he did find a set of functions that could be constexpr'ed. I'll be taking a look at this section of code to see what I can do, myself. https://www.youtube.com/live/1zIvfw-Zv4s?si=aSW8ScoTs52ec4kE&t=301

ehaliewicz commented 7 months ago

I.e., this idea I've had of simply getting the DMA to access textures on ROM directly and load them into a frame buffer in VRAM. The Video Display Processor and DMA in tandem work like a 13MHz 16-bit RISC processor. While we don't have DOOM's minimum system requirements (386 & 4MB) in the 68k and Z80, the total processing power and throughput are theoretically there.

How do you plan on scaling textures with this approach?

https://www.youtube.com/live/1zIvfw-Zv4s?si=aSW8ScoTs52ec4kE&t=301 I haven't watched the video, but if you're interested in "real" optimizations and improvements that could be made to the base doom engine for old, uncached cpus like the 68000, there are plenty. I have been working on a 2.5D renderer like this for the megadrive for a couple years, and have thought of/heard of many tricks. All of these ideas are used in KK's engine Dread, (https://www.youtube.com/@KKAltair , https://www.youtube.com/watch?v=doD7hmlKun8 ), except the framebuffer layout trick, he has to deal with amiga bitplane graphics instead, so you can get an idea of what's possible on this level of hardware.

Starting from high level and going to low level:

Calculate the static potentially visible set for each sector. This is like quake, but since we're "2.5D" we can go even further. The build engine sorts bunches of walls at runtime, because 2D lines do not have cyclic overlap issues. It turns out, you can actually sort walls statically at map build time (there are a couple rare edge cases that need extra sector splitting). This means each sector can have a pre-calculated list of which walls to draw, in a pre-sorted order. All the renderer needs to do is loop over them, transform, clip, and rasterize.
The rasterization loops can get quite heavy, and in C there will be a lot of excess register movement and spilling to stack, let alone a lot of branching.
- First of all, you can use a chain of addXs to add fixed point numbers together, and use the integer portions without any shifting or swapping required (I can explain this further if necessary)
- There are several possible wall configurations (opaque wall, portal w/ lower step visible, portal w/ upper step visible, portal w/ upper and lower step visible, etc), each of them needing a variable number of y-coordinates and dXdY variables to track, but if you pre-compile an asm routine for each variant, you can remove all excess branching and strip it down to the bare essentials. You can even use a binary search branch tree to test for clipping with the minimum amount of branches possible. If you have written asm column drawing routines, you can also remove all parameter passing overhead when calling them.
To strip the column drawing routines down to the minimum number of cycles, you can divide the framebuffer into two halves, each half stores word columns for the even or odd columns of the framebuffer. The first half stores the even columns, the second stores the odd colums. Each column represents 4 pixels at a time with a word, and the next word is the next 4 pixels below. Now, you can either write 4 pixels at a time with contiguous word writes move.w XXX, (a0)+, or use the extra stack pointer, which is word aligned to write bytes, skipping every other byte. move.b XXX, (usp)+ ; actually increments the user stack pointer by 2, skipping the other byte.
Massively unrolling loops (like in my PR) helps low level drawing massively, taken branches take 10 cycles each! We only want to spend an instruction or 2 per 2-4 pixels.
Finally, if you don't care about having perfect texture mapping quality, you can pre-calculate texture mapping routines for each possible wall height. This introduces artifacts when textures are clipped from below, uses a lot of ROM space, and limits the max wall height. If you were to use a large RAM buffer on the cartridge, the artifacts could be avoided. But emulators don't really support this in my experience :)

Dread essentially implements all of these techniques, which is why it's so fast and good looking. It also restricts ceilings and floors to a limited set of options, although there is a variable height renderer that nearly works.

maxxoccupancy commented 7 months ago

I've got to read your posts 2-3 times to get everything, so you're a good level above me in coding for the 68k.

The Genesis VDP lets you change the placement of textures/sprites on each scanline. Also, you can use for autoincrement a value other than one. I expect to do a bit more research on this tonight and get right back with you. I believe we could also use MOVEM for these moves, since we can get a low cost increment with each move.

maxxoccupancy commented 7 months ago

Here's an example of some 68k Assembly code setting up the DMA, so that we're all on the same page:

.wait:
        move.w VDP_ctrl,d7
        and.w #%0000000000001000,d7     ;See if vblank is running
        bne .wait                       ;wait until it is

        MOVE.W #($8100|%01110100),(VDP_CTRL)    ;ENABLE DMA
        move.w #$8F01,(vdp_ctrl)                ;set auto-inc to 1  <<<<<<<<<->>
        MOVE.W #$9780,(vdp_ctrl)                ;enable dma vram fill
        ; HALT
        MOVE.W D3,(vdp_ctrl)                    ;set dma length low byte
        MOVE.W D4,(vdp_ctrl)                    ;set dma length high byte
        MOVE.L D2,(vdp_ctrl)                    ;set destination address

        MOVE.W D0,(vdp_data)                    ;write the data, dma begins here.
        ;do I need to wait for DMA to finish before continuing?
; .waitDma:
        ; MOVE.W (vdp_ctrl),d6
        ; btst #1,d6
        ; bne .waitDma

        move.w #($8100|%01100100),(VDP_CTRL)    ;DISABLE DMA
        move.w #$8F02,(vdp_ctrl)                ;set auto-inc back to 2   <<<<<<<<->>
    popf        ;restore flags and interrupt level
    popRegs D3-D7
    RTS

And this is the critical bit of code that lets us change the autoincrement amount, measured in bytes. Had Sato-san added a fixed point adder to the VDP, we would've gotten free scaling, rotation, and warping ala Mode 7 years before NINTENDO had this in their SNES, and the Megadrive would've gotten at least another year or two on market without the need for expensive addons.

move.w #$8F01,(vdp_ctrl) ;set auto-inc to 1

Remember that I'm suggesting using the 68k to order the DMA to make a simple vertical line move from ROM to the foreground bitmap in VRAM. In theory, we can actually keep the 68k running as long as it's monopolizing the 64KB of system RAM and not touching the DMA move. So we could have the Z80 handling all the audio, the 68k just performing the complex calculations via LUTs in RAM, and the DMA would be drawing lines to the foreground layer.

Foreseeable problem is that the 68k cannot then be updating the Sprite Attribute Table line by line while the DMA is working. It's only one or the other writing to VRAM.

The Genesis is also capable of drawing 80 32x32 sprites (flipped to get 64x64 textures). Using the background layer to draw the sky, then draw the floor, we would have just enough sprites to cover the walls and draw bad guys. The downside is that I've completely forgotten how to stretch big sprites at draw time. The upside is that we would get full screen, 320x224 at 60fps DOOM that makes 486 PCs look like the crap cans that they are.

maxxoccupancy commented 7 months ago

Damn, I'm smaht.

So, those 32x32 sprites don't need to be squares. They are sprites, afterall. They can be any quadrilateral.

Imagine being close to a wall (on your left side). The steepest angle on the screen would be about 45 degrees, so the computer must draw two vertically flipped sprites for that section of wall. However, they are not square. The left side is 32 pixels high, the bottom is 32 pixels wide, and the right side is just 24 pixels high. So there's a 32x8 triangle of transparent pixels carved into the top of this otherwise 32x32 texture.

Now we place the upper lefthand corner of this sprite in the upper lefthand corner of the screen. However, to get a variable amount of stretching done, we change the sprite placement scanline by scanline. Because the VDP of the Genesis allows us to change the location of each sprite on each scanline, we can stretch or shrink it vertically, and we can do this individually.

To get the rest of the wall, we flip the sprite and draw it in reverse order.

For the rest of the walls, we use the same trick. Again, the upper lefthand corner of the sprite must be placed at the edge of the previous wall, then drawn in 32-pixel wide strips. (Unlike the DOOM engine, we could even have angled walls or even wrap textures around objects using this technique.)

We then draw 10-12 of these sprites across, ranging from 2-8 high, depending on which walls we're drawing. That would put us close to the maximum of 80 32x32 sprites onscreen. 1280 8x8 tiles only allows 320x256 coverage with no monsters, but this is just barely enough to draw the walls, using the background layer for the sky and floor, then the foreground layer to draw the cieling, including openings (using transparent pixels).

And we could use the existing DOOM engine to calculate the starting point for each vertical line using simple ADDQ.W or even ADDQ.B instruction, which is a quick 4 cycles per vertical line, or 8 cycles with a store. Since we've got 448 clock cycles per horizontal line at 60fps, we could probably get 40-50 updates per scanline using the CPU. If we had LUTs set up, we could just get the 68k to figure out which LUT to select and the DMA to overwrite the correct part of the Sprite Attribute Table on each scanline.

Because the VDP draws horizontal lines at 60fps, get perfectly smooth motion to rival a Pentium MMX 133--the first PCs capable of drawing full screen with full detail at 60Hz. Such a machine was simply unavailable in the PC market in 1993 when DOOM came out.

Finally, we use Shadow Highlight Mode to give us free lighting and shadows in the game with none of those pesky calculations eating up clock cycles from our precious 68k.

ehaliewicz commented 7 months ago

Here's an example of some 68k Assembly code setting up the DMA, so that we're all on the same page:
.wait:
        move.w VDP_ctrl,d7
        and.w #%0000000000001000,d7     ;See if vblank is running
        bne .wait                       ;wait until it is

        MOVE.W #($8100|%01110100),(VDP_CTRL)    ;ENABLE DMA
        move.w #$8F01,(vdp_ctrl)                ;set auto-inc to 1  <<<<<<<<<->>
        MOVE.W #$9780,(vdp_ctrl)                ;enable dma vram fill
        ; HALT
        MOVE.W D3,(vdp_ctrl)                    ;set dma length low byte
        MOVE.W D4,(vdp_ctrl)                    ;set dma length high byte
        MOVE.L D2,(vdp_ctrl)                    ;set destination address

        MOVE.W D0,(vdp_data)                    ;write the data, dma begins here.
        ;do I need to wait for DMA to finish before continuing?
; .waitDma:
        ; MOVE.W (vdp_ctrl),d6
        ; btst #1,d6
        ; bne .waitDma

        move.w #($8100|%01100100),(VDP_CTRL)    ;DISABLE DMA
        move.w #$8F02,(vdp_ctrl)                ;set auto-inc back to 2   <<<<<<<<->>
    popf        ;restore flags and interrupt level
    popRegs D3-D7
    RTS
And this is the critical bit of code that lets us change the autoincrement amount, measured in bytes. Had Sato-san added a fixed point adder to the VDP, we would've gotten free scaling, rotation, and warping ala Mode 7 years before NINTENDO had this in their SNES, and the Megadrive would've gotten at least another year or two on market without the need for expensive addons.

move.w #$8F01,(vdp_ctrl) ;set auto-inc to 1

Remember that I'm suggesting using the 68k to order the DMA to make a simple vertical line move from ROM to the foreground bitmap in VRAM. In theory, we can actually keep the 68k running as long as it's monopolizing the 64KB of system RAM and not touching the DMA move. So we could have the Z80 handling all the audio, the 68k just performing the complex calculations via LUTs in RAM, and the DMA would be drawing lines to the foreground layer.

Foreseeable problem is that the 68k cannot then be updating the Sprite Attribute Table line by line while the DMA is working. It's only one or the other writing to VRAM.

The Genesis is also capable of drawing 80 32x32 sprites (flipped to get 64x64 textures). Using the background layer to draw the sky, then draw the floor, we would have just enough sprites to cover the walls and draw bad guys. The downside is that I've completely forgotten how to stretch big sprites at draw time. The upside is that we would get full screen, 320x224 at 60fps DOOM that makes 486 PCs look like the crap cans that they are.

I'm still not sure how you plan on scaling textures. Are you going to have pre-scaled textures in ROM?

And, those 486s are not crap, they are way faster than the 68k in the genesis. Cache, pipelined, less cycles per instruction, and clocked, what, 5x faster? That's why doom ran ok on them and didn't need crazy low level optimizations (the only asm in doom is in fixed point math routines, and the column/row drawing routines. I'm suggesting writing nearly the entire renderer in asm and maybe generating over 100kb in unrolled loops). We have to work much harder to accomplish a much worse result here.

Edit: just saw and read your second post. It's a neat trick, you can dynamically shape the size of sprites via horizontal interrupts, but there are two issues

this won't provide uniform scaling, you'll get stretching or cutoff effects near the top or bottom when shrinking or stretching is done
horizontal interrupts are quite heavy, and you have a limited number of cycles per scanline. Managing to change these parameters for more than a couple sprites per line is going to be tricky. You'd have to limit the number of walls to a maximum to make sure you manipulate all the wall sprites in time.

maxxoccupancy commented 7 months ago

I know that the method that I came up with is a bit of a stretch (no pun intended), but it's a cool way of getting the VDP to draw, warp, stretch, or even vertically scale textures.

Horizontal interrupts are CPU-heavy, but the NES didn't even have them, forcing devs to count cycles. This has to be done on limit pushers and many 3D games. The SMS ran 228 cycles per scanline, I think. The Genesis executes almost exactly 488 cycles per scanline, or 100-122 simple register-register instructions. If we're stretching ten sprites per scanline, then we have 8-10 instructions per sprite. That is why I step down from 32 pixels high on the left side down to 24-25 on the right. The CPU only needs to change the vertical position by one bit (using the fast ADDQ instruction) 8 times, or once every four pixels.

If this turns out to be too much work in 1/60th of a second, we can drop down to a still smooth 30 fps, leaving one odd numbered frame for just the walls and the even numbered frame for all of the other game logic. Or H&VBlank for updating sprites and the active screen for doing the complex math to make this all possible.

If we were not doing the DOOM project, this 32x32x24 texture would work well on another Genesis/MD project that needed 80x60 4800 textured, shaded polygons per second. Something along the lines of Virtua Fighter 2, Virtua Racing, Daytona USA, or even Panzer Dragoon.

Obviously, I would be realistic and do Daytona USA so that the track can be done color cycling and well known tricks, while the canyon walls were done using the technique described above. To make any racing game work, I'd use the 320x448i mode to smooth out the motion and let Shadow Highlight Mode take care of all of the lighting issues.

The Genesis/MD VDP has a few unique goodies inside to make superscalar, bumps, vibration, smooth movement, banking, and other goodies that made the game possible on that machine. So it just seems strange that SEGA choose not to mint a few million back in March of 1994 when they were dominating the home console market.

maxxoccupancy commented 7 months ago

I just realized that I might not have been clear on my explanation for the Abramson Quad, since I couldn't find another name for it. To cut off the top triangle, you can select a time--say line number 100--and update the Sprite Attribute Table to cut off the top 8 lines of the AbeQuad. That gives you a 24x32 sprite that you can stretch or shrink vertically as the VDP draws each scanline. You could even move the shape left or right and get the shapes and textures that work for that part of the object. https://www.copetti.org/writings/consoles/mega-drive-genesis/

Gaming Secrets also showed a lot of cool tricks for getting 3D effects on side scrollers, from parallax scrolling on the floor (can do this with walls, also) to moving objects in the background slightly slower, and even using animated sprites (for example, to draw the side of the car as it turns left and right). https://youtu.be/nXKs1ZSgMic?si=s7vDEROMBTR_Vx8M

I'd thought of using some of these tricks on DOOM, but they're better suited for a racing game. If you've ever seen DC Racers or Street Racer on the Genesis, then you can see how hard they could push the hardware without even spending a ton on big game carts. This might be doable for building the floor on those DOOM levels, since they're basically just scrolling textures and can be shifted from side to side using calculations already being performed in DOOM right now. https://youtu.be/c6d8Gu9ItHg?si=3c_tvSuH4RVIl767

ehaliewicz commented 7 months ago

I'm not really interested in spamming up this project any more since I'm not working on this, but again, shrinking a sprite will not provide proper texture mapping, vdp does not track subtexels which is necessary in order to do this properly.

You could technically shift a sprite up or down on each scanline necessary to display the right texel at that pixel, but that would be tons of overhead. Might as well just do it in software like everybody is already doing. Then you get right back to how slow the 68000 is.

If I'm still misunderstanding your idea, perhaps implementing it will make it clear how it would work.

Also, there is something like 40-50 cycles of overhead when the 68k responds to an interrupt, and writing to VRAM during active display is slowed down quite a bit.

maxxoccupancy commented 7 months ago

The NES doesn't even provide that interrupt service, so programmers there had to learn how to count cycles. Some on the SMS demo scene are counting cycles to get the SMS drawing 3D effects and polygons like a much more powerful system. I'll just set that idea aside for the time being.

What I need to understand, though, is exactly the path that these textures are moving. I keep thinking that the Genesis/MD is better suited for using the 68k to set up the start points and lengths of each line, then having the DMA read in texture data from the ROM cart and draw those lines to a frame buffer in VRAM.

Since these lines would just be drawn over the foreground layer, the VDP could draw monster sprites over the top of this with no loss in performance. That seems like the most efficient, rather than having the 68k perform double duty, then having the DMA move that finished image from RAM->VRAM, cutting our frame rate in half.

Am I correct that our engine has been doing this the long, roundabout way so far? Am I misreading r_draw.c?

maxxoccupancy commented 7 months ago

Other reason for doing this is that the VDP has built in hardware for vertical and horizontal flipping of sprites/tiles. That means that we could just draw the texture from the upper half of a wall and vertically flip everything with no overhead. Most walls could be simplified for a near doubling of our frame rate, albeit at the expense of some wall textures not looking quite the same.

Again by removing the lighting functions and use Shadow/Highlight Mode, so each wall tile gets assigned Shadow or Highlight for lighting. Again the VDP does this at run time with no loss.

I'm watching these two tutorials again to look through the draw functions to see where we can skip a step or use the more powerful DMA and VDP to get some of the work of the engine done more quickly: https://youtu.be/huMO4VQEwPc?si=HFFwrCfjalwj_xkC

maxxoccupancy commented 5 months ago

Is working on a similar project?I just had an idea. The DMA can draw vertical lines straight from the ROM cartridge. To cut down on render time (albeit with a loss in accuracy), we use a 320x192 screen and 70 VBlank lines of DMA transfer. Instead of proper scaling of textures, we just draw the inside of the texture.

So if a texture is 64 pixels high, we scale it up to 64, 128, and 192 pixel versions (prescaled and pre-angled) then only read the inner part of the texture, centered on line 96.

70 lines of VBlank should give us 14,350 bytes, or 28,700 pixels (in pairs). That would let us draw an average of 45 pixels of height, but we would likely average 60-80 pixels of height, then flip most of those walls vertically and use the remaining cycles to draw stairs.

It would look pretty rough but display at 60fps.

maxxoccupancy commented 5 months ago

Setting the DMA for 128K mode. Any feedback from anyone (even if you know nothing of the Genesis/MD internals or DMA) would be greatly appreciated. This DOOM engine would simply use the fast VRAM Copy to take about 24KB of textures in VRAM and VRAM Copy (a very fast move that can occur even during Active Display). I believe that most of the wall textures would have to be downgraded such that 128x128 would have to be reduced to 64x64, while others would have to be reduced to a finite number of tiles using lossy compression schemes, though that software isn't available to us.

The main use is updating 2-pixel-wide columns of pixel graphics[5], but other kinds of data such as tilemaps and sprite tables can also be updated in meaningful ways by only updating either the upper or lower byte of each entry; all this is faster than the usual way of transferring all 16 bits in 64k mode (unless just a few bytes are transferred in one go) and for pixel graphics this also saves VRAM and/or additional CPU resources by not having to deal with the usual workarounds for updating single bytes.