Unreliable Co Pro Detection on BREAK

hoglet67 commented 2 years ago

We are seeing this with the latest hognose dev: 110708dd

Several things need to be in place to trigger the bug:

It only occurs on the Master
It only occurs when switching Co Pros (even just when switching between 0 and 1)
It only occurs on a normal BREAK, never on Ctrl-BREAK (even if there is no language)
It only occures when vdu=1
It happens on a Pi Zero and a Pi Zero 2 (not tested other models)
If you comment out the refreshing of the VDU splash screen it doesn't happen (and it seems it was introduced with 799cd479 when we started to refresh the splash screen when switching Co Pros).

Initially I thought it was the same as #141, where the host write to &FEE0 was delayed.

However, the below ICE trace indicates it is different.

This is what you see with the ICE:

>> watch r fee0
>> watchw fee0
>> c
CPU free running...
00.026139 : Mem Wr Watch hit at E377 writing FEE0:01  .
00.026143 : Mem Rd Watch hit at E37A reading FEE0:4E  N
00.026151 : Mem Wr Watch hit at E381 writing FEE0:81  .
00.026155 : Mem Rd Watch hit at E384 reading FEE0:4E  N
Ex Brkpt hit at 82B9
Interrupted
00.026169 : 82B9 : AD 34 FE : LDA $FE34
>> rd fee0
Rd:  FEE0:CE  .

The second read of &FEE0 should see 4F, so the write is delayed..

However, a manual read of FEE0 in a breakpoint much later still sees 4E, so the write was actually lost, not just delayed. This makes it different to #141.

What might caused a lost tube message?

If the tube code running on GPU core 1 is somehow blocked by the firmware blob running on GPU core 0. But then I would expect the read of &FEE0 to return garbage.
If the doorbell message is overwritten by a subsequent message before it has time to be read. I think that might be what's happening here. That could happen if the ARM FIQ interrupts were blocked. Or if the ARM FIQ handler was completely evicted from cache.

Only the following tube messages are seen by the ARM

host writes to all addresses
host reads to odd addresses

So in the above tube detection code, only the two writes are seen by the ARM, and somehow the second write is getting lost!

A few more things I tried:

If you move the call to fb_show_splash_screen() into the beginning of wait_for_rst_release, then the splash screen is update when break is depressed. This doesn't help.
If you call tube_io_handler() after fb_show_splash_screen() to make sure the tube_io_handler() is in the cache, that also doesnt help.

One way to debug this is to add "telltales" to the GPU code and to the tube_io_handler() code. Then just look at the timings.

hoglet67 commented 2 years ago

Here's a trace of tube detection working (without switching Co Pros): P1060292s

D0 - Phi2 D1 - nTUBE (the trigger) D2 - RnW D3 - D0 D6 - GPU code telltale D7 - tube_io_handler() telltale

You are seeing the results of this fragment of code:

.LE375
LDX #&01
STX LFEE0
LDA LFEE0
EOR #&01
LDX #&81
STX LFEE0
AND LFEE0
ROR A
RTS

And here's the failure: P1060293s

The second completion of tube_io_handler() completes in an unusually short time.

hoglet67 commented 2 years ago

It looks like the problem is TUBE_ENABLE_BIT in tube_irq is somehow getting unexpected cleared.

That's why tube_io_handler completes in an unexpected short time.

Just need to work out why now....

hoglet67 commented 2 years ago

This is the reason: https://github.com/hoglet67/PiTubeDirect/blob/hognose-dev/src/tube-client.c#L77

This code predates the addition of the TUBE_ENABLE_BIT to tube_irq flags (to support the Null Co Pro)

When changing Co Processors initially the tube is (wrongly) disabled. This happens as soon as BREAK is depressed.

It's only re-enabled at the end of wait_for_rst_release() and because of the debounce delay, this is actually about 1ms after nRST has gone high. Maybe longer, I need to check....

This blocks writes to the tube registers.

Changing it to the following resolves the issue:

  // Clear all old interrupts, and set tube_enable appropriately
  if (copro_def->type == TYPE_HIDDEN) {
     tube_irq = 0;
  } else {
     tube_irq = TUBE_ENABLE_BIT;
  }

(but possibly we also have and issue when switching to the null co processor)

hoglet67 commented 2 years ago

Working reliably now.... ;-)

BigEd commented 2 years ago

Just for the record, my Master booted to MODE 134 which slowed it down enough not to have this problem.

hoglet67 / PiTubeDirect

Unreliable Co Pro Detection on BREAK #156