Closed hoglet67 closed 2 years ago
There is a chance there could be some improvement. This little setup loop ends up touch 512Kbytes of memory.
loopdejitsetup: subs r3,r3,#1 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp,[r4],#4 // JITTEDTABLE16 bne loopdejitsetup
Breaking it up into two loops and writing 4 four words at once would improve its performance and memory utilisation on Pizero where the cache is very small.
Only a guess, the timing is just on the edge. I think breaking up the JIT init setup loop into two loops will give a significant performance improvement. which if we are too close timing wise should help.
You mean something like this:
// setup table JITLET ( 64K x bl JITLET)
// setup JITTEDTABLE16 with 64K x mov pc,r14
ldr temp2,=dojit-JITLET-8
mov r0,#JITLET
mov r3,#0x4000 // 64K/4
mov temp2,temp2,LSR#2
add temp2,temp2,#BLINSTRUCTION
loopdejitsetup1:
subs r3,r3,#1
str temp2,[r0],#NEXTJITLET
sub temp2,temp2,#NEXTJITLET>>2
str temp2,[r0],#NEXTJITLET
sub temp2,temp2,#NEXTJITLET>>2
str temp2,[r0],#NEXTJITLET
sub temp2,temp2,#NEXTJITLET>>2
str temp2,[r0],#NEXTJITLET
sub temp2,temp2,#NEXTJITLET>>2
bne loopdejitsetup1
mov r3,#0x4000 // 64K/4
ldr temp,=MOVPCR14INSTRUCTION
mov r4,#JITTEDTABLE16
loopdejitsetup2:
subs r3,r3,#1
str temp,[r4],#4 // JITTEDTABLE16
str temp,[r4],#4 // JITTEDTABLE16
str temp,[r4],#4 // JITTEDTABLE16
str temp,[r4],#4 // JITTEDTABLE16
bne loopdejitsetup2
// flush caches
I just tried this and it made no difference....
As splitted up the loop doesn't work then this isn't going to work.
// setup table JITLET ( 64K x bl JITLET) // setup JITTEDTABLE16 with 64K x mov pc,r14 ldr temp2,=dojit-JITLET-8 mov r0,#JITLET
mov r3,#0x4000 // 64K/4 mov temp2,temp2,LSR#2 add temp2,temp2,#BLINSTRUCTION
loopdejitsetup1: subs r3,r3,#1 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 bne loopdejitsetup1
mov r5,#0x4000 // 64K/4 ldr r0,=MOVPCR14INSTRUCTION mov r4,#JITTEDTABLE16
mov r1, r0
mov r2, r0
mov r3, r0
loopdejitsetup2: subs r5,r5,#1
stmia r4!,{r0,r1,r2,r3}
bne loopdejitsetup2
On Thu, 25 Nov 2021 at 21:27, David Banks @.***> wrote:
There is a chance there could be some improvement. This little setup loop ends up touch 512Kbytes of memory.
loopdejitsetup: subs r3,r3,#1 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp,[r4],#4 // JITTEDTABLE16 bne loopdejitsetup
Breaking it up into two loops and writing 4 four words at once would improve its performance and memory utilisation on Pizero where the cache is very small.
Only a guess, the timing is just on the edge. I think breaking up the JIT init setup loop into two loops will give a significant performance improvement. which if we are too close timing wise should help.
You mean something like this:
// setup table JITLET ( 64K x bl JITLET) // setup JITTEDTABLE16 with 64K x mov pc,r14 ldr temp2,=dojit-JITLET-8 mov r0,#JITLET
mov r3,#0x4000 // 64K/4 mov temp2,temp2,LSR#2 add temp2,temp2,#BLINSTRUCTION
loopdejitsetup1: subs r3,r3,#1 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 str temp2,[r0],#NEXTJITLET sub temp2,temp2,#NEXTJITLET>>2 bne loopdejitsetup1
mov r3,#0x4000 // 64K/4 ldr temp,=MOVPCR14INSTRUCTION mov r4,#JITTEDTABLE16
loopdejitsetup2: subs r3,r3,#1 str temp,[r4],#4 // JITTEDTABLE16 str temp,[r4],#4 // JITTEDTABLE16 str temp,[r4],#4 // JITTEDTABLE16 str temp,[r4],#4 // JITTEDTABLE16 bne loopdejitsetup2
// flush caches
I just tried this and it made no difference....
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-979475557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIRQQK6VDLTS7O22MJDUN2S3DANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I tried another experiment with the ICE...
I set a breakpoint at D9CD (the start of the Beeb's reset handler). This allows me to wait there for a few seconds before allowing the Beeb to boot and probe the tube. Even if I wait for a few seconds, the bug still happens. This suggests to me it's not loopdejitsetup code being slow, as this still runs as soon a BREAK is released.
What has to run for the STA &FEE0/LDA &FEE0 code to work?
As far as I remember, you'll get:
All that has to happen before LDA &FEE0 gets to the tube read cycle, so within about 3x 2MHz cycles, or 1.5us!
But none of this involved the JIT Co Pro.
So is this all down to that code path being slowed down because the JIT in reset has evicted everything from the cache?
But then why doesn't this show up on the Master?
Maybe it's just the Master code is less time critical, so lets look:
.LE375
LDX #&01
STX LFEE0
LDA LFEE0
EOR #&01
LDX #&81
STX LFEE0
AND LFEE0
ROR A
RTS
That's probably less time critical because it doesn't matter if the first LDA &FEE0 reads stale data, because bit 0 in the stale data will be zero (because it's cleared on reset). But it serves to warm up the cache. Does that seem plausible to you Dominic?
Adding this after the caches have been flushed seems to help significantly:
// Warm up the cache...
mov r0,#0x000F
BL tube_io_handler
// Continue with existing code...
mov r0, #0x10000
ldrh temp0, [r0, #-4] // Fetch the vector address
Can you possibly test presenting the rube_reg with &CF ?
On Thu, 25 Nov 2021 at 21:56, David Banks @.***> wrote:
I tried another experiment with the ICE...
I set a breakpoint at D9CD (the start of the Beeb's reset handler). This allows me to wait there for a few seconds before allowing the Beeb to boot and probe the tube.
Even if I wait for a few seconds, the bug still happens.
This suggests to me it's not loopdejitsetup code being slow, as this still runs as soon a BREAK is released.
What has to run for the STA &FEE0/LDA &FEE0 code to work?
As far as I remember, you'll get:
- GPU sees the tube write caused by STA &FEE0 and sends message via the doorbell which causes an ARM FIQ
- arm_fiq_handler_jit (assembler) runs and calls
- tube_io_handler (C), which in turn calls
- tube_host_write (C), which updates tube_regs[0] in the block of 8 IO words which is seen by the GPU
All that has to happen before LDA &FEE0 gets to the tube read cycle, so within about 3x 2MHz cycles, or 1.5us!
But none of this involved the JIT.
So is this all down to that code path being slowed down because everything is evicted from the cache?
But then why doesn't this show up on the Master?
Maybe it's just the Master code is less time critical, so lets look:
.LE375 LDX #&01 STX LFEE0 LDA LFEE0 EOR #&01 LDX #&81 STX LFEE0 AND LFEE0 ROR A RTS
That's probably less time critical because it doesn't matter if the first LDA &FEE0 reads stale data, because bit 0 in the stale data will be zero. But it served to warm up the cache. Does that seem plausible to you Dominic?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-979485050, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIVTSFZKDP5YBJBQSELUN2WHFANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
With or without the above fix?
Emails collided. It might be interesting to try either. Warming up the cache is interesting and I'm slightly surprised we haven't seen this sort of problem before.
On Thu, 25 Nov 2021 at 22:08, David Banks @.***> wrote:
With or without the above fix?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-979488453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFITEV2IUUCGRDAVPJQTUN2XXPANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Can you possibly test presenting the rube_reg with &CF ?
Not sure exactly what this means....
Earlier on before the first access do tube_reg[0] = 0xCF;
On Thu, 25 Nov 2021 at 22:12, David Banks @.***> wrote:
Can you possibly test presenting the rube_reg with &CF ?
Not sure exactly what this means....
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-979489501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIVB5UBXFUXT3EDTDITUN2YFNANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Why 0xCF in particular?
Oh, I get it... let me give that a try....
That's also an effective workaround... (it works independenty of the code to warm up the cache)
static void tube_reset()
- HSTAT1 |= HBIT_3 | HBIT_2 | HBIT_1;
+ HSTAT1 |= HBIT_3 | HBIT_2 | HBIT_1 | HBIT_0;
I suspect a side effect of initializing the tube_reg[0] to CF rather than CE is that now the Master code will fail, so lets just test that....
... and yes, that moved the problem from the Model B to the Master
Which rules out any RST switch bounce nonsense....
I need to finish now, but I'll come back to this tomorrow....
Warming up the cache is interesting and I'm slightly surprised we haven't seen this sort of problem before.
I've done a bit more testing, and it looks like the Native ARM Co Pro suffers the same issue on the Model B
Dominic,
Rather than pre-load the cache, I wondered if it might be better to try to make tube_io_handler a bit more efficient at dealing the the tube detection case.
This is what I've ended up with:
int tube_io_handler(uint32_t mail)
{
// 23..16 -> D7..D0
// 12 -> RST (active high)
// 11 -> RnW
// 10..8 -> A2..A0
//
// Shortcut writes of 0x81 or 0x01 to the tube control register bit 0
// This is used for tube detection by the MOS
// Writes need to be visible within three 6502 bus cycles (1.5us)
//
// D D D D D D D D - - - R R A A A - - - - - - - -
// mask: 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 = 0x7F1F00
// value: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0x010000
if ((mail & 0x7F1F00) == 0x010000) {
if (tube_irq & TUBE_ENABLE_BIT) {
HSTAT1 = (HSTAT1 & ~HBIT_0) ^ ((mail & (1 << 23)) ? HBIT_0 : 0);
// then process the request as normal
}
}
// Check for Reset
if (mail & (1 << 12)) {
tube_irq |= RESET_BIT;
} else {
unsigned int addr = (mail >> 8) & 7;
// Check read write flag
if (mail & (1 << 11)) {
tube_host_read(addr);
} else {
tube_host_write(addr, (mail >> 16) & 0xFF);
}
}
return tube_irq ;
}
I also got rid of all the non-GPU code that was #ifdeffed out.
The code that is generated for the fast path is:
0e40f02c <tube_io_handler>:
e40f02c: e59f3750 ldr r3, [pc, #1872] ; 0x7f1f00
e40f030: e92d4070 push {r4, r5, r6, lr}
e40f034: e0033000 and r3, r3, r0
e40f038: e3530801 cmp r3, #65536 ; 0x10000
e40f03c: e59f4744 ldr r4, [pc, #1860] ; address of tube_irq
e40f040: 0a000015 beq e40f09c <tube_io_handler+0x70>
<non-fast-path-code ommitted>
e40f09c: e5943000 ldr r3, [r4]
e40f0a0: e3130008 tst r3, #8
e40f0a4: 0affffe6 beq e40f044 <tube_io_handler+0x18>
e40f0a8: e3a01202 mov r1, #536870912 ; 0x20000000
e40f0ac: e59120a0 ldr r2, [r1, #160] ; 0xa0 == read tube_regs[0]
e40f0b0: e1a037a0 lsr r3, r0, #15
e40f0b4: e2033c01 and r3, r3, #256 ; 0x100
e40f0b8: e3c22c01 bic r2, r2, #256 ; 0x100
e40f0bc: e1833002 orr r3, r3, r2
e40f0c0: e58130a0 str r3, [r1, #160] ; 0xa0 == write tube_regs[0]
e40f0c4: eaffffde b e40f044 <tube_io_handler+0x18>
I'm in two minds about whether this is the right fix...
On the one hand, it does make the tube detection case considerable faster.
On the other hand, it makes all the other cases a few instructions longer.
Dave
This sounds much better than warming the cache. I wonder if it should actually be in the fiq handler so there is less cache lines to load before returning the result.
On Sun, 28 Nov 2021, 12:00 David Banks, @.***> wrote:
Dominic,
Rather than pre-load the cache, I wondered if it might be better to try to make tube_io_handler a bit more efficient at dealing the the tube detection case.
This is what I've ended up with:
int tube_io_handler(uint32_t mail) { // 23..16 -> D7..D0 // 12 -> RST (active high) // 11 -> RnW // 10..8 -> A2..A0 // // Shortcut writes of 0x81 or 0x01 to the tube control register bit 0 // This is used for tube detection by the MOS // Writes need to be visible within three 6502 bus cycles (1.5us) // // D D D D D D D D - - - R R A A A - - - - - - - - // mask: 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 = 0x7F1F00 // value: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0x010000 if ((mail & 0x7F1F00) == 0x010000) { if (tube_irq & TUBE_ENABLE_BIT) { HSTAT1 = (HSTAT1 & ~HBIT_0) | ((mail >> 15) & HBIT_0); // then process the request as normal } }
// Check for Reset if (mail & (1 << 12)) { tube_irq |= RESET_BIT; } else { unsigned int addr = (mail >> 8) & 7; // Check read write flag if (mail & (1 << 11)) { tube_host_read(addr); } else { tube_host_write(addr, (mail >> 16) & 0xFF); } } return tube_irq ; }
I also got rid of all the non-GPU code that was #ifdeffed out.
The code that is generated for the fast path is:
0e40f02c
: e40f02c: e59f3750 ldr r3, [pc, #1872] ; 0x7f1f00 e40f030: e92d4070 push {r4, r5, r6, lr} e40f034: e0033000 and r3, r3, r0 e40f038: e3530801 cmp r3, #65536 ; 0x10000 e40f03c: e59f4744 ldr r4, [pc, #1860] ; address of tube_irq e40f040: 0a000015 beq e40f09c <tube_io_handler+0x70> e40f09c: e5943000 ldr r3, [r4] e40f0a0: e3130008 tst r3, #8 e40f0a4: 0affffe6 beq e40f044 <tube_io_handler+0x18> e40f0a8: e3a01202 mov r1, #536870912 ; 0x20000000 e40f0ac: e59120a0 ldr r2, [r1, #160] ; 0xa0 == read tube_regs[0] e40f0b0: e1a037a0 lsr r3, r0, #15 e40f0b4: e2033c01 and r3, r3, #256 ; 0x100 e40f0b8: e3c22c01 bic r2, r2, #256 ; 0x100 e40f0bc: e1833002 orr r3, r3, r2 e40f0c0: e58130a0 str r3, [r1, #160] ; 0xa0 == write tube_regs[0]
I'm in two minds about whether this is the right fix...
On the one hand, it does make the tube detection case considerable faster.
On the other hand, it makes all the other cases a few instructions longer.
Dave
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hoglet67/PiTubeDirect/issues/141#issuecomment-981072842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVVFIWRGFIZTYG6XW3C7ADUOIKXVANCNFSM5IZJKULQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I wonder if it should actually be in the fiq handler so there is less cache lines to load before returning the result.
Possibly, but we have three seperate FIQ Handlers (the normal one, the JIT 6502 one and the Native ARM one)
Tube_io_handler is actually called from quite a few places:
copro-armnativeasm.S: bl tube_io_handler // Update the Tube ULA emulation
copro-armnativeasm.S: bl tube_io_handler // Update the Tube ULA emulation
jit13k.S: BL tube_io_handler
jit.S: BL tube_io_handler
tube-isr.c: intr = tube_io_handler(mailbox);
tube-isr.c: intr = tube_io_handler(mailbox);
tube-isr.c: intr = tube_io_handler(mailbox);
tube-isr.c: intr = tube_io_handler(mailbox);
tube-isr.c: intr = tube_io_handler(mailbox);
tube-isr.c: intr = tube_io_handler(mailbox);
tube.S: bl tube_io_handler
Why not do both the IRQ improvement and the cache-warming? It feels reasonable to me and should add some margin. Unless warming the cache is too inelegant or unmaintainable?
I would rather keep the cache warming in reserve, as in needs to be added seperately to each Co Pro that blats the cache in some way on reset.
Just a little update all the uses of tube_io_handler in tube-isr.c are #ifdefined out at the moment. jit13k.S was a commit that shouldn't have been made. copro-armnativeasm.S actuall again only has one use and the other is a #def build option. So there are infact only three uses. The being of each of the FIQ handlers are very similar and so should perhaps actually be a macro.
Closing this for now, until we have some evidence the current approch needs improving.
Pressing BREAK (and esp. Ctrl-BREAK) on a Pi Zero/Model B is unreliable - about 50% of the time the tube detection fails.
(Originally reported by KenLowe on stardot, but reproduced by myself)
This happens very frequently on beta1, and very very rarely (if ever) on alpha4.
I've bisected all the commits between alpha4 and beta1 and this is the one where the behaviour changes significantly: 4b6710d6.
However, looking at the that commit, the changes don't relate in any way to the JIT-6502, and with the default value of vdu=0 in the cmdline.txt, this code is never even executed.
As far as I can tell, this new issue:
The Model B Tube Detect code is pretty simple:
I've instrumented this using the ICE-6502.
What seems to be happening is the read of &FEE0 is returning &CE where as it should be reading &CF.
If I set a breakpoint and manually read &FEE0 a second time, then the correct value &CE is returned. So I think what's happening is that when the JIT-6502 is "cold" after a reset, effect of the write of &81 to &FEE0 (which is meant to set bit 0) is being delayed, and is happening too late.
What I'm not clear on is why the Master seems to be immune. I can think of two possibilities: