sinara-hw / sinara

Sayma AMC/RTM issue tracker
Other
42 stars 7 forks source link

Sayma AMC 3.3V failure #567

Closed sbourdeauducq closed 5 years ago

sbourdeauducq commented 6 years ago

One of the Sayma AMC cards (F) has an intermittent 3.3V problem. The 3.3V rail fails between <1s and hours after power-up. Exar dump from one incident when it failed shortly after startup:

------------ Exar Dump ----------
GET_HOST_STS 0x2 0x4 0x3
GET_FAULT_STS 0x5 0x0 0x20
PWR_GET_STATUS 0x9 0xd 0x2
PWR_READ_VOLTAGE_CH1 0x10 66 66 V
PWR_READ_VOLTAGE_CH2 0x11 56 56 V
PWR_READ_VOLTAGE_CH3 0x12 120 120 V
PWR_READ_VOLTAGE_CH4 0x13 100 100 V
PWR_READ_VOLTAGE_IN 0x14 160 160 V
sbourdeauducq commented 6 years ago

A third AMC I had and the last board you sent me (without the SDRAM bug) also don't have the 3.3V bug when powered in the same way.

gkasprow commented 6 years ago

After several power cycles and a few hours running ARTIQ (>30W of power) I still cannot recreate the issue on any of boards. @sbourdeauducq What kind of ATX supply do you use? Older designs did not stabilize 12V rail, only 5V. So the voltage may be significantly lower and not stable. New, Gold+ class designs use resonant converter that delivers stable +12V and generates remaining rails using buck converters from 12V.

hartytp commented 6 years ago

What does the Exar register dump tell you about the cause of the 3V3 failure on @sbourdeauducq's board?

jbqubit commented 6 years ago

@sbourdeauducq Do you have a laboratory power supply available (instead of ATX)?

gkasprow commented 6 years ago

It says : GET_HOST_STS 0x2 0x4 0x3 GET_FAULT_STS 0x5 0x0 0x20 PWR_GET_STATUS 0x9 0xd 0x2

which means channel 2 OVP event, CH2 fault. So there could be i.e. sudden voltage burst at the input. Anyway, I will check if such case can trigger OVP event in Exar chip. Of course I can switch off the OVP protection but I'd prefer to understand why it gets activated.

hartytp commented 6 years ago

@gkasprow it seems unlikely that this could be due to @sbourdeauducq's power supply. Doesn't the Exar chip also monitor the input supply, and wouldn't it register an input over voltage event separately? If there is a burst of noise at the input, why does it only affect this one output?

Can you do a step response measurement on the Exar output to see if there are any more issues with loop stability (e.g. if a capacitor has been damaged/degraded).

gkasprow commented 6 years ago

Exar chip that we use accepts input voltages up to 40V so it's hard to trigger input OVP. But sudden peak can trigger output OVP because the regulation loop works with certain delay. Old-type ATX supplies that regulate only 5V rail, work in non-continuous mode when have only 12V rail loaded. I had issues with such supply and had to apply load to 5V rail to get stable operation. So this may be a reason.

gkasprow commented 6 years ago

Anyway I will try both approaches - add voltage bursts to the input and do step response test to the output and see what happens.

hartytp commented 6 years ago

@gkasprow true. I remember having to stick some pretty chunky load resistors on 5V rails before to get the PSUs to behave themselves. Still though, I'm surprised that only a single channel is tripping. Suggests that maybe the loop isn't as stable for that channel as the others.

hartytp commented 6 years ago

:+1:

gkasprow commented 6 years ago

3.3V rail is the most vulnerable. the OVP is set to react at certain threshold above nominal value. Let's say 100mV. And the higher is the output voltage, the easier is to trigger the OVP. Let's assume that there is 5V burst so instead of 12V Exar chip gets 17V. For 3.3V rail the PWM was 27.5% so for short moment before the loop reacts, the output capacitor is charged to 4.675V via inductance. In case of 1.5V the PWM is 12.5% so the output cap is charged to 2.125V. The worst-case over-voltage for 3.3V rail is 1.375 and for 1.5V is 0.625. So the 3.3V will be first that triggers such protection. Such case affects all DC/DC converters but Exar is very fragile due to numerous protection circuits. It's better to keep them on at least during development phase.

hartytp commented 6 years ago

@gkasprow okay, point taken!

@sbourdeauducq do you have any other boards currently displaying this issue? If so, can you stick a scope on the 12V input and set it to trigger if the voltage goes above, say, 13V? That would allow us to rule out the PSU.

sbourdeauducq commented 6 years ago

do you have any other boards currently displaying this issue?

No, but I can still double-check the ATX power supply (I already had a quick look with a multimeter and didn't see an issue).

gkasprow commented 6 years ago

Multimeter wont show short peaks.

czw., 2 sie 2018, 02:56 użytkownik Sébastien Bourdeauducq < notifications@github.com> napisał:

do you have any other boards currently displaying this issue?

No, but I can still double-check the ATX power supply (I already had a quick look with a multimeter and didn't see an issue).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sinara-hw/sinara/issues/567#issuecomment-409770988, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-vtCeu7tdJh36xDe4gr59IMErg5Cmks5uMk4ggaJpZM4VDma4 .

sbourdeauducq commented 6 years ago

Nothing too crazy on ATX 12V, but there is still a quite high 320mVpp of ripple (with one Sayma connected).

gkasprow commented 6 years ago

I would look at the ATX output with scope for search of short bursts in certain load conditions. Anway I will investigate it myself.

hartytp commented 6 years ago

@sbourdeauducq Do you have any boards left where you can reproduce this issue?

I can see a few options for tracking this issue down:

sbourdeauducq commented 6 years ago

Do you have any boards left where you can reproduce this issue?

No.

gkasprow commented 6 years ago

This issue could be somehow related with lacking soft-start circuit. If you use long leads or your supply has poor pulse responses, missing FMC or RTM soft-start circuit can trigger Exar fault. I examined the boards I got from you yesterday mating with various RTMs, leaving for hours working with ARTIQ and so far did not observe single failure.

hartytp commented 6 years ago

Some comments/suggestions:

hartytp commented 6 years ago

Is there any plan to do anything else about this issue, or should we close it?

gkasprow commented 6 years ago

I was not able to reproduce it. About the AMC+RTM box - I still didn't have time to update the mechanical documentation. I have to produce a batch still this year for AFCZ project. 10PCBs are produced and assembled, only mechanics is missing. It's not the highest priority for me at the moment.

hartytp commented 6 years ago

Well, unless @sbourdeauducq objects, let's assume this is an issue with noise spikes on the PSU he was using and close this issue.

Rather than trying to debug Sayma for every PSU that anyone has, let's focus on getting it to work in the uTCA racks; if that works robustly then I'm happy.

sbourdeauducq commented 5 years ago

This is happening in the µTCA crate (of course). @gkasprow @hartytp please reopen this issue.

sbourdeauducq commented 5 years ago

And with the new MMC firmware (required to get µTCA to work at all) I cannot get an Exar dump...

sbourdeauducq commented 5 years ago

And now the board just won't power up at all.

sbourdeauducq commented 5 years ago

The faulty board powers up from the ATX connector. I swapped another AMC board into the µTCA crate, and it powers up, so this does not seem to be a problem with the crate.

gkasprow commented 5 years ago

what MMC says on debug port when U plug it to the crate?

sbourdeauducq commented 5 years ago

The other board failed as well, so this doesn't seem to be an isolated problem...

gkasprow commented 5 years ago

only 3.3V is missing or another board doesn't simply boot? Please place here MMC log

sbourdeauducq commented 5 years ago

The other board also has 3.3V failures. And I found the reason for the full shutdowns - I was using an old version of flterm that caused the MMC to enter the bootloader after I had looked at the MMC log and power-cycled the board. So, the only issue remaining here is why 3.3V fails after the board has been running for some time.

sbourdeauducq commented 5 years ago

@gkasprow What MMC log do you want exactly? At what time? Note that I cannot get Exar debug info with the new OpenMMC-based firmware. The boards now exhibit the 3.3V bug after a few minutes now, very annoying.

gkasprow commented 5 years ago

OK, @wizath can you check if Exar register dump works with newest MMC firmware? @sbourdeauducq do you use FMC-LVDS ?

sbourdeauducq commented 5 years ago

@sbourdeauducq do you use FMC-LVDS ?

No.

sbourdeauducq commented 5 years ago

@gkasprow Can you set up a µTCA crate and leave it running for a few days until Sayma breaks?

gkasprow commented 5 years ago

@marmeladapk Can we do this?

marmeladapk commented 5 years ago

@gkasprow I'm not sure if we have a set that boots, we have this one AMC with broken SDRAM.

gkasprow commented 5 years ago

@sbourdeauducq can U add 47uF ceramic to the 3V3 converter output? It's quite possible that capacitances lost their value with time and this can be quick fix.

sbourdeauducq commented 5 years ago

There is at least one "broken SDRAM" issue that might be fixed by upgrading to Vivado 2018.3. Anyway you do not need SDRAM to test power supplies.

sbourdeauducq commented 5 years ago

I don't know if this is related or not, but it seems the Exar power supply broke completely on a board that I was using in the µTCA crate. I had used the "activate/deactivate FRU" command in NATview to restart the board many times to work around the 3.3V failures. Now only the 0.9V LED is on, both in the µTCA crate and when powered from ATX.

marmeladapk commented 5 years ago

There is at least one "broken SDRAM" issue that might be fixed by upgrading to Vivado 2018.3. Anyway you do not need SDRAM to test power supplies.

It won't fix dead chip. And I wanted to put higher load on this Sayma.

sbourdeauducq commented 5 years ago

The board has been powered up for over 24 hours without a failure, which is unusual. Things that have changed:

Could this be temperature-dependent?

gkasprow commented 5 years ago

Yes, capacitors have high tempco. And this influences converter stability.

hartytp commented 5 years ago

Turn down the fans and see if it craps out?

@gkasprow clearly there have been some issues with these converters. What will we do to ensure that this does not happen in Sayma v2.0? Higher quality capacitors? More careful tuning of the SMPS control loops? @marmeladapk can you add this to the list of things to discuss in the next call please?

gkasprow commented 5 years ago

I use Exar chips in several designs and never had such issues before. The configuration we use in the AFCZ is identical as in Sayma. Only MOSFETs are different because original ones were EOL. Good thing is that it is fully programmable so we can update the settings. There is certainly something I overlooked.

jbqubit commented 5 years ago

https://github.com/sinara-hw/Urukul/issues/20#issuecomment-458026113

sbourdeauducq commented 5 years ago

Board has still been running continuously without issue, so there is definitely something here.

Turn down the fans and see if it craps out?

After I'm done using the board for more productive work. It's good that there is at least a workaround so I'm not wasting my time power-cycling the board in the middle of something else and starting over again. And I'd prefer @gkasprow et. al. to reproduce the issue and investigate it.

hartytp commented 5 years ago

[@gkasprow via email] It seems that I managed to tweak the Exar settings in AFCZ and it works reliably on all boards.

Great, thanks for the update. What was the issue?

Are you confident that this will fix the problems with Sayma going forwards? Why do you think this problem didn't show up immediately? Was it due to component ageing? If so, are we sure that further component tolerances/ageing won't cause similar issues later on. Is it worth using higher quality caps with better stability to prevent such issues from occurring?

gkasprow commented 5 years ago

One problem was with too steep v/t coefficient so it triggered overcurrent during power on cycle. That's why one channel didn't start from time to time. Another issue was overvoltage protection caused by too low phase margin and sudden current load changes Yet another issue was a too low current limit of 1.2V which was triggered by 3 SDRAM controllers that from time to time consumed higher current than usual. 3A is simply not sufficient even though they consume 1A during normal operation. When multiple write/read bursts overlap the peak current is much higher. I didn't want to fix all current limit settings for max value to not damage the SoC and enabled all possible protection mechanisms. Once we make sure that design works as expected, some protection features can be simply disabled or their settings relaxed.

jbqubit commented 5 years ago

@gkasprow Did you apply your AFCZ insights to Sayma v1 and confirm that it works without dropouts or shutdowns?