some timer tests and more

notaz commented 1 year ago

https://notaz.gp2x.de/md/rel/testpico-438648d-for-pd.bin

"time z80 loop" (fail) - just idle loops z80 for ~1 frame worth of cycles and reads the vcounter. In PD that results 1 more than the hardware.
"time vcnt loops" (pass) - loops on each vcount value and counts iterations. Shows that all lines are equal and the rumored half-line doesn't exist (in non-interlaced mode at least).
"time timer a/b z80" (pass) - waits for some long timer period and compares to vcounter. Couldn't get this stable on real hardware so the test accepts rather a large range.
"timer b stop" (fail) - demonstrates that the timer is "active" even when all enable bits are cleared. More on that below.
"timer ab sync" (fail) - with this I tried to show that 16 ticks of timer A is equal to 1 tick of timer B, and when synchronized they will trigger simultaneously. In practice this is almost true, but very occasionally only timer A bit is latched. I've added one re-read and it became stable (always reads as 3). The only unexpected thing is that I had to set timer A to 15 and not the expected 16.
"32x disable" (fail) - this shows that despite what the docs say, 32X can actually be disabled. The 68k "BIOS" gets unmapped and ROM shows up again. Haven't checked if the graphics layers get also disabled. I now use this for easy reset (no more need to write a special 32X reset handler at 800200). Shouldn't be hard to implement in PD.

So from testing the timers it seems they "run" regardless of register settings, what the settings control is what to do on each timer tick, i.e. to increment the counter or not, and to set status bit (that the CPUs can read) on wraparound or not. I also suspect that whenever there is 0->1 transition for the enable bit, it doesn't actually do anything instantly, only sets some internal flag. Only when the tick happens with that internal flag set, it actually reloads the count value programmed by the CPU. That's why in the "timer ab sync" I had to set A to 15 (and not 16) to get it to sync with timer B.

irixxxx commented 1 year ago

"time z80 loop" needs that obnoxious 3823 value changed to pass this test, for cz80 to 3824 and for drz80 to 3825. That's somehow suspicious, to say the least. This cannot just be a rounding error when converting 68k and z80 cycles, there's something off in the z80 cores.

"timer a/b sync" fails due to timer start/stop not being aligned to the internal ~53000Hz clock of the ym2612. It indeed passes when this is taken into account for loading next_oflow in ym_sync_timers on timer enable. Very roughly estimating the internal ym clock with 64 z80 clocks is good enough here, so replacing xcycleswith (((xcycles>>14)+1)<<14) is working.

irixxxx commented 1 year ago

"timer b stop" confuses me. From what I see in the asmtools source (unimportant stuff excluded):

timer b starts on scanline 0, configured for the shortest possible time. It roughly runs for 5 scanlines.
after status bit signals completion, save current scanline, which is 5.
about 1900 cycles are burned away. That's roughly another 4 scanlines.
timer b is started again. It should run the same time aa before, so about 5 scanlines.
after completion, save current scanline, which is now 14.

The difference between the 2 saved scanline counters is 9, but the test expects it to be between 4 and 5. There must be an error in my understanding what you intended to do. Could you point me at my error please?

notaz commented 1 year ago

I tried to explain this in my previous post. Basically it's impossible to stop the timers, they are always ticking. When disabled they just take no action when they tick. So in our example, assuming initial tick happens on scanline 0, the next ones will be roughly on scanlines 5 and 10 regardless of when/how we program reg 0x27 (unlike what PD wrongly emulates since like forever).

irixxxx commented 1 year ago

Are you sure about this? Is this indeed passing on real hardware? I just had a peek at both GX+ and Blastem. They both implement timer ticking only if the timer is enabled by the corresponding bit (either bit 0 or 1) of the control register. And as far as I can tell from a quick glance at Nuked OPN2, I'd say that also only increments the timer if the corresponding bit is set.

notaz commented 1 year ago

Yes this is always passing on real hardware. On my MD1 with discrete ym2612 at least. Blastem nightly too.

ticking only if the timer is enabled by the corresponding bit (either bit 0 or 1) of the control register. And as far as I can tell from a quick glance at Nuked OPN2, I'd say that also only increments the timer if the corresponding bit is set.

I'm not trying to deny that. What I'm calling a "tick" is a possibly no-op hardware event that is happening regardless of any software settings and can't be influenced by anything (except reset perhaps). On this event it may do nothing when the control register is 0, or increment the counter according to bits 0 and 1, like you just said. However the software can't reset or otherwise influence the interval by messing with the control register. If the timer B triggered on (second part of) scanline 0, next time it will trigger is on scanline 5, regardless if you re-enable it on scanline 0, 1, 2, 3 or 4. Is that more clear?

notaz commented 1 year ago

https://notaz.gp2x.de/md/rel/testpico-635f245-for-pd.bin

I've updated "timer ab sync" to also show that writing count values to 0x24/0x25 or 0x26 doesn't take effect until you either toggle control bits 0/1 from 0->1 in 0x27 or until previous timer expires.

I'm unsure if it's worth to emulate all these details but I do remember having some grief from games breaking because of timers. Latest Blastem seems to have all these details right.

irixxxx commented 1 year ago

Sorry for me being a bit dogged here. Please correct me if I'm misinterpreting you:

You say timers are running in the background all the time. Besides, you say timers are not reloaded on setting bit 0/1 of the control register, but only on overflow - at least that's what the test seems to imply.

I've implemented this behaviour, but then the "time timer" tests measuring if the timer runs for the correct period are failing. That's only logical, since the next overflow has possibly no connection to the timer value written - that value would only be used after reload on overflow.

So either my understanding is still off, or it sounds like there must be other conditions influencing this, since the test runs on real hardware. Is this maybe coupled to writing a new timer value? Say, when not writing a new value before setting bit 0/1 of timer control, the timer isn't loaded?

irixxxx commented 1 year ago

BTW, implementing the last sentence of my previous message passes the timer tests in the emu. So, that might actually be what's happening. To support this, could you add a check starting a timer, then reset the bit 0/1, but keep bit 2/3 set? If the theory is correct, it should still be reporting the timer overflow, although bit 0/1 is off.

notaz commented 1 year ago

timers are not reloaded on setting bit 0/1 of the control register, but only on overflow

Hmm no, setting the bit also reloads, but only if it was 0 before (that's what I meant by 0->1 transition), and only at the time of the next timer tick, not instantly.

not writing a new value before setting bit 0/1 of timer control, the timer isn't loaded?

It think it should reload (if the control bits were 0), I'll try to add a test for that.

could you add a check starting a timer, then reset the bit 0/1, but keep bit 2/3 set?

I think I tried this and it never set the overflow flag. But I'll add such test to be sure when I have time.

irixxxx commented 1 year ago

timers are not reloaded on setting bit 0/1 of the control register, but only on overflow Hmm no, setting the bit also reloads, but only if it was 0 before (that's what I meant by 0->1 transition), and only at the time of the next timer tick, not instantly.

Another idea. There must be something in the timer b hardware which counts to 16, as a divider of the tick. What if that isn't reset on timer b load? In that case, the first overflow of timer b would happen 0 to 15 ticks too early. If timer b is restarted while this divider is in the same cycle of counting to 16 and the value to load is 0xff, it would indeed look as if the timer was not being loaded but running in the background. That would be easy enough to test for though, just loading the timer with 0xfe instead of 0xff would quickly show this.

could you add a check starting a timer, then reset the bit 0/1, but keep bit 2/3 set? I think I tied this and it never set the overflow flag. But I'll add such test to be sure when I have time.

If status is set it would confirm timers running in the background. The opposite isn't necessarily true though - e.g. setting the status bit might be masked if the load bit is not set.

irixxxx commented 1 year ago

BTW, it would also explain this, at least for timer b:

"time timer a/b z80" (pass) - waits for some long timer period and compares to vcounter. Couldn't get this stable on real hardware so the test accepts rather a large range.

And while I'm at it, I dug into my memory:

"time vcnt loops" (pass) - loops on each vcount value and counts iterations. Shows that all lines are equal and the rumored half-line doesn't exist (in non-interlaced mode at least).

IIRC those half lines are actually only needed for interlacing. That half line just delays displaying the 1st line a tiny bit, which leads to the lines displayed just a bit down between the lines of the first field. So, with interlace we might actually see those half lines.

notaz commented 1 year ago

https://notaz.gp2x.de/md/rel/testpico-3d80f94-for-pd.bin

Shows that not writing count values will still reload and control bits 0,1 disable setting of status bits.

There must be something in the timer b hardware which counts to 16, as a divider of the tick. What if that isn't reset on timer b load?

Doesn't "timer b stop" already prove this is the case? Didn't quite understand what 0xfe for timer B would show, note that "timer ab sync" already uses 0xfc.

irixxxx commented 1 year ago

Sorry no, that wasn't obvious to me. It could also have meant that under certain conditions no reload is taking place. I was so persistent because I couldn't believe this. It would have meant that decades ago some hardware engineer had committed valuable resources into tracking a state where it wasn't necessary.

Anyway, meanwhile I have implemented this and it seems to work with this. I reckon by Occam's razor it's also the most probable solution.

That leaves "32x disable". I have a simple hack for this which seems to work, but for the RV bit stuff. RV hasn't been supported before.

notaz commented 1 year ago

I have a simple hack for this which seems to work, but for the RV bit stuff. RV hasn't been supported before.

The test doesn't really care if RV works though, it just checks if it can set/clear that bit, and keeps it set for convenience, to be able to execute from the usual ROM area. It can't really test much anyway as IIRC with RV=0 the ROM area is just open bus (except 68k BIOS) and with RV=1 attempting to read 0x800000 hangs the machine.

irixxxx commented 1 year ago

Implemented "32x disable". The RV bit is indeed just stored, the mappings were already there.

irixxxx commented 1 year ago

I think I know now why "time z80 loop" fails:

A scanline has 3420 mclocks. That amounts to ~488.57 68k cycles and exactly 228 z80 cycles. Picodrive however uses 488 68k cycles throughout the board. The calculation of z80 cycles (*7/15) only makes for ~227.73 for 488 68k cycles. Though that doesn't sound much, over the 2 frames of measurement time it accumulates to ~140 z80 cycles less - more than half a scanline, which is why the end of the test lands some 40 cycles in scanline 2, instead of 1 as it should.

I can think of 2 ways to remedy this:

change the computation of z80 cycles from 68k cycles so that 488 68k cycles is closer to 228 z80 cycles. That would mean to change 3823 back to 3827.
or else, give up the 488 cycles per scanline for 68k and make that closer to the real 488.57. That would mean changes all over the emulator core though, with potential consequences in some places like e.g. VDP timing.

With the former, the test lands nearly 100 clocks before the end of scanline 1, and no other test is failing. OD2 is still landing on spot with the end of the sound, so all seems OK for now. @notaz if you don't have any better idea, going back to 3827 seems to be the easy route for now. Going away from 488 looks like a bigger project than I'm willing to take ATM - if ever since it's so ingrained in the core.

notaz commented 1 year ago

I think the 2nd option is the way to go eventually though. I've tried patching all CPUS_RUN(CYCLES_M68K_LINE) to CPUS_RUN(CYCLES_M68K_LINE + (y & 1)) which predictably desynced OD2's audio, but then I noticed removing the - 64 timer hacks puts it back in sync. That is a good accuracy win I think. Other things like hcounter shouldn't be hurt too much as they are relative to Pico.t.m68c_line_start from what I see. I don't know what else could cause problems, some new stuff from videoport.c I guess?

irixxxx commented 1 year ago

Just adding one cycle every other line is basically ok. I was thinking along that line myself, especially since this wouldn't introduce any problems with older save files.

I had a look at where the value 488 is occurring in the core and there aren't as many as I've expected. So, it might indeed not be such a big project after all. Places I see were this can produce inaccuracies:

computing the Hint scheduling time for 32x (uses a fixed multiplier of 488)
computing the conversion factors for mega cd clock (also assumes scanline length as 488, easy to fix)
computing the line length in memory.c:get_scanline (is Q8 anyway so shouldn't be much of a problem)
some timing stuff in videoport.c. There are some slightly wrong assumptions in that anyway which needs revisiting
maybe some clock conversion factors in sound.c. For FM this might be very slightly ajar anyway if using the resampler, since the up/downsampling filter setup tries to keep the filter kernel size small and accepts a small error instead.

Another way would maybe use the double value for the 68k clock. The additional accuracy would probably do away with any problems in videoport.c, though most of the calculations in there are already using only half the clock as resolution since the long 68k bus cycles make it coarse anyway, besides not knowing the exact offset of the access from the start of the instruction (can be something between 4 and 36 IIRC).

Moreover, since a lot of timing is fine tuned to critical stuff in some games, I reckon a good amount of testing would be needed.

irixxxx commented 1 year ago

BTW, I have a question about the z80_test.s80. Before I'm going to write some test code to verify it on my hardware: Are you sure about the 9 cycles you assume for ld ixl, and ld ixh? Drz80 assumes 8 here, Sean Youngs undocumented z80 doc also assumes 8, and from https://floooh.github.io/2021/12/06/z80-instruction-timing.html and the github project behind the netlist simulation I also think it should be 8.

However, cz80 is assuming 9 as well. Changing all places involving ixl/ixh to 8 (and fixing a small bug in Drz80's main loop related to testing the cycle count) makes them behave identical in your tests.

notaz commented 1 year ago

I was just being lazy and took it from cz80, as it's not in official docs. It's likely cz80 bug.

irixxxx commented 1 year ago

Just adding a cycle every other line is breaking some timing-critical things. For example OutRunners crazy dots are back, MCD1 US BIOS isn't working, and Fhey area hangs randomly in the intro. Since the 32X internal timing is based on the 68k clock, 32X Hint will have jitter of 1 68k clock. At least the Mars Check isn't working with that anymore. That at least can't be easily fixed.

Not sure if I'm currently willing to see this through. It surely seems like a lot of work.

irixxxx commented 1 year ago

@notaz I've pushed a branch "timing" for testing purposes with 488.5 cycles. Have a look, please.

notaz commented 1 year ago

Looks good, but what's with 312 in pico/cd/mcd.c? Believe it or not MD has 313 lines, see the line counter in PicoFrameHints(), also testpico tests for it.

Other than that random stuff I tried seems to run well, which is cool.

irixxxx commented 1 year ago

I've originally planned to change pcd_set_cycle_mult to get the reference to clocks per line out of it, but reverted it back wrongly. Now changed to compute this by the oscillator clocks. After checking the results we are talking about less than 0.01% difference.

Things I think are suspicious:

I needed to up the stolen refresh cycles for cart games a tiny bit. Basically this means the added cycle is partially stolen away again to get things back running. It's very sensitive too, with the edge cases being OutRunners and testpico and the range between them is quite small.
I introduced refresh cycle stealing to MCD as well, but that is even more sensitive. Fhey area needs a very low value of about 1 refresh every 256 cycles here. This would basically mean there's no refresh delay when running from RAM, but according to articles on spritesmind that is apparently not true.
I needed to move VDP HBLANK up a slot for testpico. It basically means that HBLANK isn't asserted at the start of slot 0x93, but at the end.
I had to move HINT on 32X a bit earlier (1.5 68k clocks, ~4SH2), else Mars Check now fails. This wasn't there before.

irixxxx commented 11 months ago

@notaz I think that timing branch is now good enough for merging. All my timing-critical examples seem to run fine with it, and some random testing showed no new problems. Any objections?

notaz commented 11 months ago

LGTM too.

irixxxx commented 11 months ago

With the exception of "irq ack h-v" which we already discussed in #90 all other tests now pass for both ntsc and pal. Fixed mainly by e0c4dac1, c3d70d13, 0e2e188e, 7263343d, 70216221 (though some of other commits might play a role in this as well).

notaz commented 11 months ago

Note that this -for-pd build has a relaxed version of "irq ack h-v", and it's not failing for me? (x86-64 Linux PD build)

What's failing occasionally is "time hblank h32" (can be occasionally reproduced by hitting reset (tab) repeatedly, somewhat more frequently on PAL).

irixxxx commented 11 months ago

What does "relaxed" mean?

The timing for hblank is indeed tight. Adding a value between 4 and 12 to "hp" in VideoSr will fix it.

Maybe another case of not knowing how many cycles in the instruction execution the memory access is away? In that case I still think this can only be improved by changing the instruction cycle billing from the end to the start of the execution.

notaz commented 11 months ago

What does "relaxed" mean?

Ok maybe not relaxed, just different: https://github.com/notaz/megadrive/blob/3d80f9400569f545d9e651a68aa0382f7ca6db40/testpico/asmtools.S#L81

this can only be improved by changing the instruction cycle billing from the end to the start of the execution.

I wonder if that would be enough, as it would still be inaccurate for read-modify-write instructions. It shouldn't hurt performance too much if the cycle counter would be properly updated before each memhandler call by cpu cores. Naturally it'd be tons of work updating them...

irixxxx commented 11 months ago

Ok maybe not relaxed, just different: https://github.com/notaz/megadrive/blob/3d80f9400569f545d9e651a68aa0382f7ca6db40/testpico/asmtools.S#L81

Ah yes, that's indeed working. ~~Instead of 2 bus cycles it uses 5, about 12 cycles more~~ Hmm no, it's an additional 4 cycles for the insn writing the IRQ enable to the VDP, which moves the status reading by these 4 cycles due to the late cycle billing.

this can only be improved by changing the instruction cycle billing from the end to the start of the execution. I wonder if that would be enough, as it would still be inaccurate for read-modify-write instructions. It shouldn't hurt performance too much if the cycle counter would be properly updated before each memhandler call by cpu cores. Naturally it'd be tons of work updating them...

Yes, but I reckon a faster way would probably be to pass the relative cycle in the instruction as an additional argument to the memhandlers. It's a constant which is regularly passed in a register, and memhandlers not needing this info could simply ignore it.

irixxxx / picodrive

some timer tests and more #95