Research and implement better PPU IO fetch timings

The PPU IO fetch timing is roundabout correct but definitely off by a couple of cycles here and there. This issue exists to keep track of progress towards achieving full accuracy here and to take notes off my findings so far.

[ ] Mode 0 rendering really seems to begin four cycles earlier (at cycle 28 - BGHOFS % 8 * 4 instead of 32 - BGHOFS % 8 * 4). VRAM access likely is done in a pipelined way where the VRAM access happens four cycles after the address calculation.
- TODO: does the same also apply for Modes 2 - 5?
[ ] WIN[x]H reads seem to happen one cycle earlier or WIN[x]H writes take a cycle to apply.
- The former would be weird because we already do the first WIN[x]H read on the first cycle of the scanline.
- The latter would make sense if there is a general one cycle IO write delay. This has not been proven to exist yet, however at least a couple of IO registers (IE, IF, IME and TM[x]CNT) seem to have such a delay.

nba-emu / NanoBoyAdvance

Research and implement better PPU IO fetch timings #324