hoglet67 / PiTubeDirect

Bare-metal Raspberry Pi project that attaches to the Acorn TUBE interface and emulates many BBC Micro Co Processors
GNU General Public License v3.0
188 stars 23 forks source link

JIT 6502: Reset issues on PiZero/Model B #141

Closed hoglet67 closed 2 years ago

hoglet67 commented 3 years ago

Pressing BREAK (and esp. Ctrl-BREAK) on a Pi Zero/Model B is unreliable - about 50% of the time the tube detection fails.

(Originally reported by KenLowe on stardot, but reproduced by myself)

This happens very frequently on beta1, and very very rarely (if ever) on alpha4.

I've bisected all the commits between alpha4 and beta1 and this is the one where the behaviour changes significantly: 4b6710d6.

However, looking at the that commit, the changes don't relate in any way to the JIT-6502, and with the default value of vdu=0 in the cmdline.txt, this code is never even executed.

As far as I can tell, this new issue:

The Model B Tube Detect code is pretty simple:

LDA #&81
STA &FEE0
LDA &FEE0
ROR A
BCC &DB4D

I've instrumented this using the ICE-6502.

What seems to be happening is the read of &FEE0 is returning &CE where as it should be reading &CF.

If I set a breakpoint and manually read &FEE0 a second time, then the correct value &CE is returned. So I think what's happening is that when the JIT-6502 is "cold" after a reset, effect of the write of &81 to &FEE0 (which is meant to set bit 0) is being delayed, and is happening too late.

What I'm not clear on is why the Master seems to be immune. I can think of two possibilities:

hoglet67 commented 3 years ago

There is a chance there could be some improvement. This little setup loop ends up touch 512Kbytes of memory.

loopdejitsetup:
subs  r3,r3,#1
str   temp2,[r0],#NEXTJITLET
sub   temp2,temp2,#NEXTJITLET>>2
str   temp,[r4],#4   // JITTEDTABLE16
bne   loopdejitsetup

Breaking it up into two loops and writing 4 four words at once would improve its performance and memory utilisation on Pizero where the cache is very small.

Only a guess, the timing is just on the edge. I think breaking up the JIT init setup loop into two loops will give a significant performance improvement. which if we are too close timing wise should help.

You mean something like this:

// setup table JITLET ( 64K x bl JITLET)
// setup JITTEDTABLE16 with 64K x mov pc,r14
   ldr   temp2,=dojit-JITLET-8
   mov   r0,#JITLET

   mov   r3,#0x4000 // 64K/4
   mov   temp2,temp2,LSR#2
   add   temp2,temp2,#BLINSTRUCTION

loopdejitsetup1:
   subs  r3,r3,#1
   str   temp2,[r0],#NEXTJITLET
   sub   temp2,temp2,#NEXTJITLET>>2
   str   temp2,[r0],#NEXTJITLET
   sub   temp2,temp2,#NEXTJITLET>>2
   str   temp2,[r0],#NEXTJITLET
   sub   temp2,temp2,#NEXTJITLET>>2
   str   temp2,[r0],#NEXTJITLET
   sub   temp2,temp2,#NEXTJITLET>>2
   bne   loopdejitsetup1

   mov   r3,#0x4000 // 64K/4
   ldr   temp,=MOVPCR14INSTRUCTION
   mov   r4,#JITTEDTABLE16

loopdejitsetup2:
   subs  r3,r3,#1
   str   temp,[r4],#4   // JITTEDTABLE16
   str   temp,[r4],#4   // JITTEDTABLE16
   str   temp,[r4],#4   // JITTEDTABLE16
   str   temp,[r4],#4   // JITTEDTABLE16
   bne   loopdejitsetup2

// flush caches

I just tried this and it made no difference....

dp111 commented 3 years ago

As splitted up the loop doesn't work then this isn't going to work.

// setup table JITLET ( 64K x bl JITLET) // setup JITTEDTABLE16 with 64K x mov pc,r14 ldr temp2,=dojit-JITLET-8 mov r0,#JITLET

mov r3,#0x4000 // 64K/4 mov temp2,temp2,LSR#2 add temp2,temp2,#BLINSTRUCTION

loopdejitsetup1: subs r3,r3,#1 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 bne loopdejitsetup1

mov r5,#0x4000 // 64K/4 ldr r0,=MOVPCR14INSTRUCTION mov r4,#JITTEDTABLE16

mov r1, r0

mov r2, r0

mov r3, r0

loopdejitsetup2: subs r5,r5,#1

stmia r4!,{r0,r1,r2,r3}

bne loopdejitsetup2

On Thu, 25 Nov 2021 at 21:27, David Banks @.***> wrote:

There is a chance there could be some improvement. This little setup loop ends up touch 512Kbytes of memory.

loopdejitsetup: subs r3,r3,#1 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp,[r4],#4 // JITTEDTABLE16 bne loopdejitsetup

Breaking it up into two loops and writing 4 four words at once would improve its performance and memory utilisation on Pizero where the cache is very small.

Only a guess, the timing is just on the edge. I think breaking up the JIT init setup loop into two loops will give a significant performance improvement. which if we are too close timing wise should help.

You mean something like this:

// setup table JITLET ( 64K x bl JITLET) // setup JITTEDTABLE16 with 64K x mov pc,r14 ldr temp2,=dojit-JITLET-8 mov r0,#JITLET

mov r3,#0x4000 // 64K/4 mov temp2,temp2,LSR#2 add temp2,temp2,#BLINSTRUCTION

loopdejitsetup1: subs r3,r3,#1 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 bne loopdejitsetup1

mov r3,#0x4000 // 64K/4 ldr temp,=MOVPCR14INSTRUCTION mov r4,#JITTEDTABLE16

loopdejitsetup2: subs r3,r3,#1 str temp,[r4],#4 // JITTEDTABLE16 str temp,[r4],#4 // JITTEDTABLE16 str temp,[r4],#4 // JITTEDTABLE16 str temp,[r4],#4 // JITTEDTABLE16 bne loopdejitsetup2

// flush caches

I just tried this and it made no difference....

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-979475557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIRQQK6VDLTS7O22MJDUN2S3DANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

hoglet67 commented 3 years ago

I tried another experiment with the ICE...

I set a breakpoint at D9CD (the start of the Beeb's reset handler). This allows me to wait there for a few seconds before allowing the Beeb to boot and probe the tube. Even if I wait for a few seconds, the bug still happens. This suggests to me it's not loopdejitsetup code being slow, as this still runs as soon a BREAK is released.

What has to run for the STA &FEE0/LDA &FEE0 code to work?

As far as I remember, you'll get:

All that has to happen before LDA &FEE0 gets to the tube read cycle, so within about 3x 2MHz cycles, or 1.5us!

But none of this involved the JIT Co Pro.

So is this all down to that code path being slowed down because the JIT in reset has evicted everything from the cache?

But then why doesn't this show up on the Master?

Maybe it's just the Master code is less time critical, so lets look:

.LE375
LDX #&01
STX LFEE0
LDA LFEE0
EOR #&01
LDX #&81
STX LFEE0
AND LFEE0
ROR A
RTS

That's probably less time critical because it doesn't matter if the first LDA &FEE0 reads stale data, because bit 0 in the stale data will be zero (because it's cleared on reset). But it serves to warm up the cache. Does that seem plausible to you Dominic?

hoglet67 commented 3 years ago

Adding this after the caches have been flushed seems to help significantly:

// Warm up the cache...
   mov  r0,#0x000F
   BL   tube_io_handler

// Continue with existing code...
   mov   r0, #0x10000
   ldrh  temp0, [r0, #-4]        // Fetch the vector address
dp111 commented 3 years ago

Can you possibly test presenting the rube_reg with &CF ?

On Thu, 25 Nov 2021 at 21:56, David Banks @.***> wrote:

I tried another experiment with the ICE...

I set a breakpoint at D9CD (the start of the Beeb's reset handler). This allows me to wait there for a few seconds before allowing the Beeb to boot and probe the tube.

Even if I wait for a few seconds, the bug still happens.

This suggests to me it's not loopdejitsetup code being slow, as this still runs as soon a BREAK is released.

What has to run for the STA &FEE0/LDA &FEE0 code to work?

As far as I remember, you'll get:

  • GPU sees the tube write caused by STA &FEE0 and sends message via the doorbell which causes an ARM FIQ
  • arm_fiq_handler_jit (assembler) runs and calls
  • tube_io_handler (C), which in turn calls
  • tube_host_write (C), which updates tube_regs[0] in the block of 8 IO words which is seen by the GPU

All that has to happen before LDA &FEE0 gets to the tube read cycle, so within about 3x 2MHz cycles, or 1.5us!

But none of this involved the JIT.

So is this all down to that code path being slowed down because everything is evicted from the cache?

But then why doesn't this show up on the Master?

Maybe it's just the Master code is less time critical, so lets look:

.LE375 LDX #&01 STX LFEE0 LDA LFEE0 EOR #&01 LDX #&81 STX LFEE0 AND LFEE0 ROR A RTS

That's probably less time critical because it doesn't matter if the first LDA &FEE0 reads stale data, because bit 0 in the stale data will be zero. But it served to warm up the cache. Does that seem plausible to you Dominic?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-979485050, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIVTSFZKDP5YBJBQSELUN2WHFANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

hoglet67 commented 3 years ago

With or without the above fix?

dp111 commented 3 years ago

Emails collided. It might be interesting to try either. Warming up the cache is interesting and I'm slightly surprised we haven't seen this sort of problem before.

On Thu, 25 Nov 2021 at 22:08, David Banks @.***> wrote:

With or without the above fix?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-979488453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFITEV2IUUCGRDAVPJQTUN2XXPANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

hoglet67 commented 3 years ago

Can you possibly test presenting the rube_reg with &CF ?

Not sure exactly what this means....

dp111 commented 3 years ago

Earlier on before the first access do tube_reg[0] = 0xCF;

On Thu, 25 Nov 2021 at 22:12, David Banks @.***> wrote:

Can you possibly test presenting the rube_reg with &CF ?

Not sure exactly what this means....

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-979489501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIVB5UBXFUXT3EDTDITUN2YFNANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

hoglet67 commented 3 years ago

Why 0xCF in particular?

hoglet67 commented 3 years ago

Oh, I get it... let me give that a try....

hoglet67 commented 3 years ago

That's also an effective workaround... (it works independenty of the code to warm up the cache)

hoglet67 commented 3 years ago
static void tube_reset()
-   HSTAT1 |= HBIT_3 | HBIT_2 | HBIT_1;
+   HSTAT1 |= HBIT_3 | HBIT_2 | HBIT_1 | HBIT_0;
hoglet67 commented 3 years ago

I suspect a side effect of initializing the tube_reg[0] to CF rather than CE is that now the Master code will fail, so lets just test that....

... and yes, that moved the problem from the Model B to the Master

Which rules out any RST switch bounce nonsense....

I need to finish now, but I'll come back to this tomorrow....

hoglet67 commented 3 years ago

Warming up the cache is interesting and I'm slightly surprised we haven't seen this sort of problem before.

I've done a bit more testing, and it looks like the Native ARM Co Pro suffers the same issue on the Model B

hoglet67 commented 3 years ago

Dominic,

Rather than pre-load the cache, I wondered if it might be better to try to make tube_io_handler a bit more efficient at dealing the the tube detection case.

This is what I've ended up with:

int tube_io_handler(uint32_t mail)
{
   // 23..16 -> D7..D0
   // 12     -> RST (active high)
   // 11     -> RnW
   // 10..8  -> A2..A0
   //
   // Shortcut writes of 0x81 or 0x01 to the tube control register bit 0
   // This is used for tube detection by the MOS
   // Writes need to be visible within three 6502 bus cycles (1.5us)
   //
   //         D D D D  D D D D  - - - R  R A A A  - - - -  - - - -
   //  mask:  0 1 1 1  1 1 1 1  0 0 0 1  1 1 1 1  0 0 0 0  0 0 0 0 = 0x7F1F00
   // value:  0 0 0 0  0 0 0 1  0 0 0 0  0 0 0 0  0 0 0 0  0 0 0 0 = 0x010000
   if ((mail & 0x7F1F00) == 0x010000) {
      if (tube_irq & TUBE_ENABLE_BIT) {
         HSTAT1 = (HSTAT1 & ~HBIT_0) ^ ((mail & (1 << 23)) ? HBIT_0 : 0);
         // then process the request as normal
      }
   }

   // Check for Reset
   if (mail & (1 << 12)) {
      tube_irq |= RESET_BIT;
   } else {
      unsigned int addr = (mail >> 8) & 7;
      // Check read write flag
      if (mail & (1 << 11)) {
         tube_host_read(addr);
      } else {
         tube_host_write(addr, (mail >> 16) & 0xFF);
      }
   }
   return tube_irq ;
}

I also got rid of all the non-GPU code that was #ifdeffed out.

The code that is generated for the fast path is:

0e40f02c <tube_io_handler>:
 e40f02c:   e59f3750    ldr r3, [pc, #1872] ; 0x7f1f00
 e40f030:   e92d4070    push    {r4, r5, r6, lr}
 e40f034:   e0033000    and r3, r3, r0
 e40f038:   e3530801    cmp r3, #65536  ; 0x10000
 e40f03c:   e59f4744    ldr r4, [pc, #1860] ; address of tube_irq
 e40f040:   0a000015    beq e40f09c <tube_io_handler+0x70>

<non-fast-path-code ommitted>

 e40f09c:   e5943000    ldr r3, [r4]
 e40f0a0:   e3130008    tst r3, #8
 e40f0a4:   0affffe6    beq e40f044 <tube_io_handler+0x18>
 e40f0a8:   e3a01202    mov r1, #536870912  ; 0x20000000
 e40f0ac:   e59120a0    ldr r2, [r1, #160]  ; 0xa0 == read tube_regs[0]
 e40f0b0:   e1a037a0    lsr r3, r0, #15
 e40f0b4:   e2033c01    and r3, r3, #256    ; 0x100
 e40f0b8:   e3c22c01    bic r2, r2, #256    ; 0x100
 e40f0bc:   e1833002    orr r3, r3, r2
 e40f0c0:   e58130a0    str r3, [r1, #160]  ; 0xa0 == write tube_regs[0]
 e40f0c4:   eaffffde    b   e40f044 <tube_io_handler+0x18>

I'm in two minds about whether this is the right fix...

On the one hand, it does make the tube detection case considerable faster.

On the other hand, it makes all the other cases a few instructions longer.

Dave

dp111 commented 3 years ago

This sounds much better than warming the cache. I wonder if it should actually be in the fiq handler so there is less cache lines to load before returning the result.

On Sun, 28 Nov 2021, 12:00 David Banks, @.***> wrote:

Dominic,

Rather than pre-load the cache, I wondered if it might be better to try to make tube_io_handler a bit more efficient at dealing the the tube detection case.

This is what I've ended up with:

int tube_io_handler(uint32_t mail) { // 23..16 -> D7..D0 // 12 -> RST (active high) // 11 -> RnW // 10..8 -> A2..A0 // // Shortcut writes of 0x81 or 0x01 to the tube control register bit 0 // This is used for tube detection by the MOS // Writes need to be visible within three 6502 bus cycles (1.5us) // // D D D D D D D D - - - R R A A A - - - - - - - - // mask: 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 = 0x7F1F00 // value: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0x010000 if ((mail & 0x7F1F00) == 0x010000) { if (tube_irq & TUBE_ENABLE_BIT) { HSTAT1 = (HSTAT1 & ~HBIT_0) | ((mail >> 15) & HBIT_0); // then process the request as normal } }

// Check for Reset if (mail & (1 << 12)) { tube_irq |= RESET_BIT; } else { unsigned int addr = (mail >> 8) & 7; // Check read write flag if (mail & (1 << 11)) { tube_host_read(addr); } else { tube_host_write(addr, (mail >> 16) & 0xFF); } } return tube_irq ; }

I also got rid of all the non-GPU code that was #ifdeffed out.

The code that is generated for the fast path is:

0e40f02c : e40f02c: e59f3750 ldr r3, [pc, #1872] ; 0x7f1f00 e40f030: e92d4070 push {r4, r5, r6, lr} e40f034: e0033000 and r3, r3, r0 e40f038: e3530801 cmp r3, #65536 ; 0x10000 e40f03c: e59f4744 ldr r4, [pc, #1860] ; address of tube_irq e40f040: 0a000015 beq e40f09c <tube_io_handler+0x70>

e40f09c: e5943000 ldr r3, [r4] e40f0a0: e3130008 tst r3, #8 e40f0a4: 0affffe6 beq e40f044 <tube_io_handler+0x18> e40f0a8: e3a01202 mov r1, #536870912 ; 0x20000000 e40f0ac: e59120a0 ldr r2, [r1, #160] ; 0xa0 == read tube_regs[0] e40f0b0: e1a037a0 lsr r3, r0, #15 e40f0b4: e2033c01 and r3, r3, #256 ; 0x100 e40f0b8: e3c22c01 bic r2, r2, #256 ; 0x100 e40f0bc: e1833002 orr r3, r3, r2 e40f0c0: e58130a0 str r3, [r1, #160] ; 0xa0 == write tube_regs[0]

I'm in two minds about whether this is the right fix...

On the one hand, it does make the tube detection case considerable faster.

On the other hand, it makes all the other cases a few instructions longer.

Dave

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-981072842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIWRGFIZTYG6XW3C7ADUOIKXVANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

hoglet67 commented 3 years ago

I wonder if it should actually be in the fiq handler so there is less cache lines to load before returning the result.

Possibly, but we have three seperate FIQ Handlers (the normal one, the JIT 6502 one and the Native ARM one)

Tube_io_handler is actually called from quite a few places:

copro-armnativeasm.S:        bl      tube_io_handler        // Update the Tube ULA emulation
copro-armnativeasm.S:        bl      tube_io_handler        // Update the Tube ULA emulation
jit13k.S:   BL   tube_io_handler
jit.S:   BL   tube_io_handler
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube-isr.c:      intr = tube_io_handler(mailbox);
tube.S:      bl       tube_io_handler
BigEd commented 3 years ago

Why not do both the IRQ improvement and the cache-warming? It feels reasonable to me and should add some margin. Unless warming the cache is too inelegant or unmaintainable?

hoglet67 commented 3 years ago

I would rather keep the cache warming in reserve, as in needs to be added seperately to each Co Pro that blats the cache in some way on reset.

dp111 commented 3 years ago

Just a little update all the uses of tube_io_handler in tube-isr.c are #ifdefined out at the moment. jit13k.S was a commit that shouldn't have been made. copro-armnativeasm.S actuall again only has one use and the other is a #def build option. So there are infact only three uses. The being of each of the FIQ handlers are very similar and so should perhaps actually be a macro.

hoglet67 commented 2 years ago

Closing this for now, until we have some evidence the current approch needs improving.