hcs64 / neon64v2

NES Emulation On the N64, 2.0 WIP
https://hcs64.com/neon64.html
ISC License
79 stars 11 forks source link
emulator mips-assembly n64 nes-emulator

Neon64 2.0 source notes

Running

To run Neon64, check the home page or GitHub for releases, these include a manual in README.txt (from pkg/README.txt).

Building

This project uses ARM9's fork of the bass assembler, which is linked as a git submodule for easy setup. On a system with git, Make, and a C compiler, you can build Neon64 by running:

make

This will fetch bass, build the tools (bass and chksum64), and build neon64bu.rom. make pkg builds the release ZIP.

Overview

The scheduler runs the show after all the inits in loader.asm and neon64.asm. Profile bars show time on the RSP (top) and CPU (bottom) relative to a 60 FPS frame.

Scheduling

When it runs a task, the scheduler sets two variables: target_cycle is the emulated cycle when the next task should run, and cycle_balance is the current time relative to target_cycle, starting negative and counting up to expiration at 0. Global shared state is up to date as of target_cycle + cycle_balance, and the task can assume that the shared state won't change or be observed by other tasks until target_cycle. The task updates cycle_balance via dadd/daddi, and yields with bgezal when expired. The scheduler records the final target_cycle + cycle_balance time for this task, and runs the task which is now earliest.

cycle_balance can go positive if the task runs past the target cycle, when not dealing with global shared state. (Unfortunately the 6502 doesn't get to do this often, it is always on the lookout for interrupts. A better design would schedule these separately.)

There are currently 5 tasks: CPU (6502), PPU, APU frame, APU DMC, and interrupt callback (intcb).

The RSP has a simple round-robin cooperative scheduler in ucode, the tasks do some chunk of work and yield, possibly running a completion vector in a break interrupt on the CPU. Tasks can indicate if they have no work; the RSP will not resume automatically if all tasks are idle. There is a priority signal from the CPU to run a specific task next, used for the APU flush, but I don't know if this does much good. PPU and APU each have an RSP task, I'd like to add a task that generates the dlist and manages the RDP, as well.

In the profile bars, CPU scheduler time is blue; RSP scheduler time isn't tracked.

CPU (6502)

The CPU executes out of an opcode table, which has two MIPS instructions for all 6502 opcodes, see opcodes.asm. This is often enough to specify an instruction: An addressing mode is the target of a j, and the address of the execution stage is loaded in the delay slot.

The CPU address space is mapped by read and write pointers for every 256 byte page. If bit 31 is clear, this is memory, which can be accessed directly through the TLB. If bit 31 is set it points to a side effect handler.

The most common thing the 6502 does is read bytes at the current PC, so PC is a native pointer, in register cpu_mpc. It gets converted back (by subtracting cpu_mpc_base) as needed. When the PC increments normally, the pointer can just be incremented, unless it passes into another bank. To trap this, after each TLB page there is unmapped guard space, a TLB miss exception will remap the PC.

The N, Z, and C flags are used somewhat less often than they are set, so I defer updating them; operations that would change then save the relevant value in cpu_{n,z,c}_byte. I'm not sure if this is a great idea in its current form, I haven't profiled without it.

In the profile bars, CPU task time is bright green. Currently this includes any immediate catchup before a CPU write to the APU or PPU, since the task doesn't switch.

PPU

The CPU-side PPU task is FrameLoop in ppu_task.asm, broadly: Fetch sprites, fetch background, repeat. The PPU yields for long stretches after blocks of work, these happen instantaneously with respect to the CPU, except the background fetch (see ppu_catchup_cb below).

Profiling in an earlier version showed that the hottest loop was checking all 64 sprites on each of 240 lines, so instead sprite evaluation scans OAM once per frame. Each line has a list of up to 8 sprites that occur on that line. Tile fetches are done before the background fetches for each line. Sprite 0 hit uses a cycle timer, checked on PPU status read.

The background fetch is a Duff's device-ish loop, each iteration deals with all tiles using the same attribute byte, flow enters the loop depending on the alignment of the X scroll. The PPU sets up the loop variables and installs a hook (ppu_catchup_cb), then the PPU yields. If the 6502 makes a write to PPU registers, it uses the hook to catch up tile fetching; this allows for mid-scanline changes. (Currently this reinitializes most things when catching up, on every register write during non-vblank scanlines. It should be possible to reduce this time substantially, I suspect it is particularly bad for writes to 2007 when rendering is disabled.)

The CPU fetches the pattern and attribute bytes, then passes these off to the RSP. The bytes collect in 64-bit shift registers, written out uncached when full; the VR4300's write buffer can help absorb the latency of a few big uncached writes, without wasting DCache on something the CPU never reads again. When the CPU has finished 8 lines, it writes out some additional data and then updates a write cursor in RSP DMEM, letting the ucode know it can DMA in the block. This isn't a fully functional FIFO; the CPU won't start writing again until the ucode has finished the frame. The structure of this "conv" buffer is in mem.asm.

ucode converts background tiles from separate bitplane 2bpp to interleaved 4bpp, and has a loop to shift up to 7 pixels for fine X scroll. This also copies the palette for each frame, and sets the fill color in the dlist to the true background color. There is only support for one palette per frame, currently.

ucode converts sprite tiles to 8bpp and blits them into a single texture for each line in reverse index order, this handles sprite occlusion. A bit in each pixel indicates priority relative to the background, and by using half-empty palettes the RDP can render both BG and FG sprites from the same texture.

The PPU ucode relies heavily on RSP vector ops: Multiply and accumulate to "shift" and "OR" 8 pixels at once, sfv and transpose to interleave, and unaligned 8 byte load/store to blit into the sprite texture.

When the RSP has DMAd the textures out to RDRAM, it issues a break, which interrupts the CPU and schedules a frame task. The frame task waits for a free framebuffer, waits for the RDP to finish, and executes the dlist. Ideally this would instead be run by the RSP, using the XBUS so a dlist needs never touch RDRAM. The frame task does things like controller polling and profile reporting, as well, which don't fit nicely elsewhere.

The RDP interrupt handles round-robin scheduling of three dlists, currently: the main PPU frame output, text rendering, and profile bars. When the main frame dlist ends with a Full Sync, the RDP interrupt stores the framebuffer pointer for use at the next VI interrupt.

The PPU has mappings for each 1k page in the PPU address space. CHR ROM access uses a write-protected TLB page to avoid checking for permission at write time, the exception handler attempts to skip the offending instruction.

In the profile bars, PPU task time is red. The dark green bar starting from the right is vblank wait time, this is a good measure of idle time for the frame. Yellow is time the frame task wasn't waiting on vblank, this is mostly when drawing the menus. Define PROFILE_RDP to show RDP instead of RSP time.

APU

The APU has two main drivers: First, the NES has a frame counter, which changes APU state automatically about 4 times per frame. Second, the 6502 can write to APU registers. Before allowing either of these to change the APU state, the CPU writes out an alist with the previous state and how long it lasted, this is Render in apu.asm. Also, the DMC reads samples on its own timer, these are buffered until the next alist.

Audio synthesis runs on the RSP, where the APU ucode deals with constant frequency and amplitude waves. Because most of the complex behavior is clocked by the slow frame counter, most of the state and logic stays on the CPU, sort of a sequencer/synthesizer division of labor. The ucode tracks phase for the waves across alists and buffers, which can be reset for some of the channels.

Every 44160Hz PCM sample is the simple sum of two point samples at twice that rate, an amount of "oversampling" configured at build time. At a fairly high sample rate this sounds mostly OK. Mixing uses lookup tables, following blargg.

The CPU sends alists to the RSP through a FIFO, with read and write cursors in DMEM; the ucode reads in an alist when it detects the FIFO is not empty. The RSP also has a FIFO of PCM audio buffers (abufs), it breaks to interrupt the CPU when DMA completes, to play the abuf if the AI isn't already full. The alist and abuf FIFOs can't be completely full, so that read == write always means empty, and to avoid overwriting the abufs currently managed by the AI.

There is also pull from the AI side. When the AI-not-full interrupt hits, the handler gets an abuf from the FIFO. If the FIFO is empty, the handler requests a flush, which tells the ucode to finish an abuf without waiting for an alist which covers those samples. This allows the audio to stretch if the emulation falls behind, commonly because a lot of timings are rounded down; this is preferable to the alternative where the buffer fills up and we have to figure out what samples to drop. Dropping samples isn't implemented (except for DMC), instead everything will slow down to the audio rate, though I haven't tested this in the steady state.

In the profile bars, APU task time is grey. This usually doesn't show up much on the CPU, in part because writing to registers is charged to the 6502 CPU.

File-by-file

Scheduling

Hardware interfacing

Emulation

CPU

PPU

APU

Cartridge

Tools

Etc