keirf / amiga-stuff

The Unlicense
253 stars 26 forks source link

[ATK] Crash during Audio Test on TF accelerators #33

Closed SukkoPera closed 3 years ago

SukkoPera commented 3 years ago

It seems that the Audio Test crashes on accelerated machines.

The time before the crash happens is quite random. Most times it will happen within 5-10 seconds after the mod starts playing, but sometimes the song might get to the end and loop once before it crashes.

I first noticed this on my A500+ with TF530@50MHz. I dismantled the machine piece by piece to find the cause, but I never managed to eliminate it. Then I noticed the machine is perfectly stable in WB and games, and the ATK RAM test is also OK.

Then I spoke to a few friends and found out that they also experience the same problem with TF534 and TF536.

Tested on v1.12. It does NOT seem to happen with the original 68000 and I haven't been able to reproduce it in UAE either.

I guess this might also be a bug in the TF line of cards, I can't rule that out as I know no-one with a different accelerator.

I am attaching a crash log from my machine and one from a friend's, who has a TF536@50MHz.

TF536@50MHz A500+ with TF530@50MHz

keirf commented 3 years ago

Interesting. Is it only on the mod-playing test (eg. not on the tone tests)?

Exception 0x1f is level-7 autovector interrupt. That's NMI (Non Maskable Interrupt). Not really good news. I would say it sounds like a board bug, as this is an externally-triggered unmaskable interrupt caused by CPU lines IPL[2:0] all being driven low.

Now in both cases this seems to be happening around the time of the vertical blank, as vbl irq handling is happening at the time. Maybe multiple outstanding interrupts is somehow a problem on these boards... I acknowledge/clear the interrupt at the end of the handler, so INT_VBL is for example still pending in the chipset during VBL processing. Then mod-player CIA IRQ comes in... maybe that could be a problem?

Guess we'll never know since TF rage-quit the scene a week ago!

SukkoPera commented 3 years ago

It does not seem to happen on the tone tests, which I guess is why this never happened in older versions which did not have the mod play thing.

What I don't get, if this is due to some TF bug, is why it never seems to occur outside of ATK. I've been using this machine almost daily for months, playing games and doing stuff in WB and I never had a similar crash. It must be triggered by something pretty specific and uncommon.

I will do some tests on an accelerated A1200 in the afternoon, let's see what happens.

keirf commented 3 years ago

It seems to crash pretty easy, and a lot of people use ATK, and there are no reports outside of TF boards.

I could pull out my 1200/ACA1233/40 and have a go...

keirf commented 3 years ago

Tested now on ACA1233/40. No crash.

EDIT: And that's not surprising. The NMI must be triggered by the board. So it has to be board specific.

EDIT2: It's also not surprising it's extremely hard to provoke on TF boards. Most software will crash if it gets unexpected NMI. This would have been picked up in testing if it were in any way likely to happen on almost any existing software.

keirf commented 3 years ago

I think the only question here now is: Try and work around it, or leave it? Obviously in most cases you'd want to work around, but is it conceptually right to avoid crashing bad hardware in a hardware test program?

Probably ATK should have a wiki these days, and probably this should simply be doc'ed there.

SukkoPera commented 3 years ago

Also just tested on my A1200 with GVP Jaws, no crash.

As we know this crash is triggered on all TF boards, I think there should be a workaround, otherwise the Audio test would be totally unusable by whoever has one.

Maybe the user should have the ability to turn the workaround on or off at will or it could only be applied if a TF card is detected (given that there's a way to detect it).

Or maybe you could just try to reach out to Mr. SL, as he says he still wants to support his former users and I'm sure he would pay attention to you. And I'm just as sure he doesn't want this bug to sneak into the TF1260 ;).

keirf commented 3 years ago

How do you detect a TerribleFire? I could just disable the mod player in that case, with a message. Or possibly a dummy NMI handler would suffice...

SukkoPera commented 3 years ago

No idea, but if you need me to test something, just ask ;).

keirf commented 3 years ago

Yeah, I dunno, I'm not that interested if I'm honest. If TF boards are 'supported' this bug should be tracked and fixed at source. If they're not, then why should I care about them more than their author, and spend my time on them? It's not like they're vintage hardware.

SukkoPera commented 3 years ago

I'm not really happy about this decision, but I'll respect it.

Let's see if @terriblefire wants to join in.

terriblefire commented 3 years ago

I no longer have anything to do with amiga so i dont care either. If this happens on TF534 or TF330 then all the sources are out there and someone else can do the work.

I agree this is nothing to do with keirf and he shouldnt be investigating this.

keirf commented 3 years ago

Okay the only thing I note in Frank Wille's PT player code is that it clears the pending interrupt early, rather than at the end of the handler. It's better to do it at the end else the interrupt tends to double fire. I will fix that and given the unlikelihood of this bug, you might even find it fixes it.

keirf commented 3 years ago

By the way I had a discussion with another dev some months ago getting unexpected NMI in music player using CIA IRQs. That was an A1200 with GVP EC030. Weird eh. The circumstances are oddly similar and that's not a TF board. Also not the exact same protracker music player. Doesn't mean it's not a board bug though ;)

EDIT: Also that was his code (a game) not ATK.

keirf commented 3 years ago

Here's a reorganised pt player. It is nicer to the IRQ hardware (in my eyes). See if it helps.

atk_33_1.zip

SukkoPera commented 3 years ago

It doesn't seem to help, unfortunately :(.

IMG_20200927_150033

terriblefire commented 3 years ago

The only way an NMI can happen is if the Amiga system asserts all three IPL lines. Its not possible to make this happen with software

It is vaguely possible that the A500 has glitchy IPL lines that a fast running 030 could "see". Honestly just make NMI do nothing. Its not a crash condition anyways. (or show something on the UI that says its happening but not actually stop program operation).

Pretty sure thats what most people needed to add tooltypes with WHDLOAD for with fast 030 accelerators.

SukkoPera commented 3 years ago

I can try and record some data of the IPL lines with my LA, if that can help understanding what is going on.

terriblefire commented 3 years ago

Its not really the fault of AmigaTestKit though. Try installing an action replay or something like that and playing some protracker tunes. if the debugger starts you know whats happening.

SukkoPera commented 3 years ago

This is what has been happening right before the crash:

image

It looks like @terriblefire was right: the system is transitioning from a level 3 interrupt to a level 6 (if I interpret correctly). During the transition there's a brief moment where all lines are low and the faster CPU manages to "see" that and interprets it as a NMI (level 7).

Zooming in:

image

Maybe stiffer pull-ups on the IPL lines could avoid that? I think stock ones are 4.7k.

keirf commented 3 years ago

Wow that's a slow transition time. Like 1V/us slew, maybe 10-90% transition time of 4us ish? I don't think the TF board actually drives/buffers/regenerates these signals and I've never looked at them on a scope before, so perhaps they're always like this on the Amiga mainboard? It's not a pull-up issue as the transition time is symmetrically horrible.

EDIT: Possibly capacitance of your probes makes it worse? Doesn't seem to kill the 7MHZ clock signal though.

terriblefire commented 3 years ago

Yes amiga 500 chipset takes a long time change values. Even the 7Mhz clock needs buffers to prevent gitter.

My boards dont buffer these signals as they never caused OS issues.

Probably the easiest solution is to do something different with NMI.

On Sun, Sep 27, 2020 at 3:48 PM Keir Fraser notifications@github.com wrote:

Wow that's a slow transition time. Like 1V/us slew, maybe 10-90% transition time of 4us ish? I don't think the TF board actually drives/buffers/regenerates these signals and I've never looked at them on a scope before, so perhaps they're always like this on the Amiga mainboard? It's not a pull-up issue as the transition time is symmetrically horrible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/keirf/Amiga-Stuff/issues/33#issuecomment-699644687, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPWM4JICLNXMB2BSYR7EE3SH5GCFANCNFSM4RZQYTPQ .

-- Stephen Leary

SukkoPera commented 3 years ago

Yes, rise/fall time 10-90% is about 4us.

terriblefire commented 3 years ago

4us... thats slooooooooooooooooooooooooooooooooow. I'm actually surprised the plain 68000s dont complain about that.

keirf commented 3 years ago

I have a few options here... I think perhaps allowing N unexpected IRQs/NMIs per VBL and then crashing if over that threshold is a good idea for ATK. For bonus points I could report the unexpected IRQs/NMIs in an informative and non-crashing fashion...

SukkoPera commented 3 years ago

Forget about the 4us thing, it was probably due to the slow analog sampling rate of my LA. I will measure with a scope later.

It's probably more like < 1us.

keirf commented 3 years ago

Will be interested to see for sure. Of course IPL unsynchronised to the CPU clock (if that's even possible? I think it probably is but not totally sure, since I think the IPL signals are sampled on certain clock edges only) is always going to have racey transitions, it's just a question of probabilities.

keirf commented 3 years ago

Ok here's a new potential fix. It handles the previously-unhandled autovector priority levels (4, 5, and 7). It allows approximately 16 of each per vblank: A higher rate will crash exactly as before. Within the permitted rate, the IRQs/NMIs are reported on the main menu page.

atk_33_2.zip

SukkoPera commented 3 years ago

This one seems to work perfectly! I've let the mod loop 3 times (never managed to get this far before) and after that I had 44 unexpected NMIs reported in the main menu.

Well done!

keirf commented 3 years ago

Okay, I'll have another for you shortly. We may be able to cut that unexpected NMI count significantly.

keirf commented 3 years ago

How about this. See how many unexpected NMIs you get this time.

atk_33_3.zip

SukkoPera commented 3 years ago

Just 4 over 3 loops.

keirf commented 3 years ago

Ok thanks. I think overall these spurious IRQs are probably indeed a fact of life, they've also been seen on ACA500 and on GVP accelerators. I also note that Kickstart 3.1 (just one I happened to take a look at) is fully set up to deal with spurious autovectored irqs. For example it has a dummy NMI (just RTE) and all interrupt processing is gated on INTENAR and INTREQR bits. For example, for vblank work:

if ((INTENAR & 0x4000) && (INTENAR & INTREQR & 0x20)) { do vblank }

My conclusion is I will toughen all interrupt handlers, and have a spurious-irq fallback for all of them. And all of them will have a threshold beyond which they crash.

Furthermore I may remove the bold warning on spurious irqs from the main menu, and place it under a sub-option instead (eg. CIA & Chipset).

Once all this is done I will attach a final release candidate here for you to give a final test. That can then be released as v1.13

SukkoPera commented 3 years ago

That sounds great! :)

keirf commented 3 years ago

Ok, here is the release candidate. You will find the spurious IRQ count under the CIA/Chipset menu now. Check it out. I reverted the spurious irq avoidance in 33_3, so you can expect to see a double-figure count after a few loops of the MOD.

atk_33_4.zip

SukkoPera commented 3 years ago

It's working like a charm! Got 31 spurious interrupts over 3 mod loops.

keirf commented 3 years ago

Thanks, this is released as Amiga Test Kit v1.13