sinara-hw / Kasli

Kasli is a powerful FPGA carrier, capable of controlling 12 Eurocard extension modules.
Other
16 stars 1 forks source link

Power chip failures on v1.1 #92

Open dtcallcock opened 3 years ago

dtcallcock commented 3 years ago

Looks like IC6 pretty much exploded.

IMG_4490

Crate has been running for months, and was working fine at end of day on Friday. Found in this state on Mon morning so it seems like it was spontaneous rather than user-inflicted. Was powered using XP PSU it shipped with (via backplane adaptor barrel connector). That was plugged into a surge protector strip. Kasli was located in a 19" subrack with forced air cooling. Connected were 2xUrukuls, 1x DIO_SMA, 1x Sampler, 1x DIO_RJ45.

Any ideas on why it might have done this?

pathfinder49 commented 3 years ago

I've also seen an exploding SMPS on Kasli 1.1 👀

gkasprow commented 3 years ago

Such explosions happen when the input voltage is exceeded. Usually, all rails get fried. The chip has thermal protection so overheating should not cause such issues. The black residue around L16 and P3V3 test point look suspicious. It looks like some fluid was flowing and causing the short circuit. When chips explode, they usually don't leave such traces.

dtcallcock commented 3 years ago

I've also seen an exploding SMPS on Kasli 1.1

Such explosions happen when the input voltage is exceeded.

Is there an issue with the quality of PSUs then? To be clear, I'm using an XP Power AKM65US12. I just checked it and it puts out 12.38V unloaded and 11.95V into a 10R resistive load (14W).

The black residue around L16 and P3V3 test point look suspicious. It looks like some fluid was flowing and causing the short circuit.

It does, but 'to the right' on this photo would have been 'up' when mounted in rack so I'm not sure fluid would flow like this. Also, no traces of fluid on the fan tray or other electronics that sit above it in rack.

dtcallcock commented 3 years ago

Obviously v1.1 is somewhat historical but AFAICT this part of the board is very similar on V2.

Such explosions happen when the input voltage is exceeded.

For reference, the ADP5052 Vin is specced as 4.5-15V with an abs. max of 18V, so there is quite a bit of safety margin.

I'm using an XP Power AKM65US12

Btw, I note that the schematic specifies Mean Well GSM90B12-P1M. That PSU has overvoltage protection that would shut it down at <135%=16.2V (ie. below abs. max). The XP doesn't so perhaps that's a clue. Any idea which PSUs are generally out there in the wild?

image

sbourdeauducq commented 3 years ago

We've shipped plenty of systems with the AKM65US12, and have not seen this issue.

dtcallcock commented 3 years ago

We've shipped plenty of systems with the AKM65US12

Are all systems shipped with AKM65US12? Anyone know why the schematic is not followed (or what even motivated the schematic choice in first place)?

marmeladapk commented 3 years ago

Schematic choice was based on recomendation from @Robert Jördens @.***> in some old thread. (Or maybe was it just a statement that he used this supply successfully? I can't remember right now)

czw., 17 cze 2021, 02:04 użytkownik David Allcock @.***> napisał:

We've shipped plenty of systems with the AKM65US12

Are all systems shipped with AKM65US12? Anyone know why the schematic is not followed (or what even motivated the schematic choice in first place)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sinara-hw/Kasli/issues/92#issuecomment-862812366, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3AQXFPGEWRDVA2RS7ZUHTTTE3ZVANCNFSM46YN7Z3Q .

dtcallcock commented 1 year ago

The replacement board blew up in exactly the same way. The vapourized silicon and metal even exited the chip package in the same location! Again it spontaneously did this when nobody was in the lab after months of being happy. When we switched this board in we changed the PSU for the Mean Well one with over voltage protection discussed above. This probably rules out a rogue PSU and the overvoltage theory.

IMG_20230109_124957589_HDR IMG_5797

gkasprow commented 1 year ago

It looks like the mid-layer 1 where 12V is routed, was really hot. As well as the GND return path on top layer. Your power supply must be delivering a lot of current. It looks like the dc/dc converter shorted the 12V rail. Moreover, it looks like the 3V3 converter was affected first, probably killing all 3V3 circuits like SFPs.

dtcallcock commented 1 year ago

Your power supply must be delivering a lot of current.

It's a 6.67A PSU.

Moreover, it looks like the 3V3 converter was affected first, probably killing all 3V3 circuits like SFPs.

Any chance a bad SFP caused this? I believe we reused the SFP off the first fried board. It's hard to imagine it fried the board without frying itself though!

Do you want these fried boards for post-mortem? I guess it's not interesting unless\until you start seeing v2 boards fail in the field (which it sounds like isn't happening even though this part of the board is basically the same).

dtcallcock commented 1 year ago

Is it worth thinking about putting a fuse on Kasli? Especially given how many random things pull power via the EEMs. Presumably wouldn't have prevented this failure but at least there would be a better chance of tracking down the issue (and even repairing the board - they aren't cheap!) rather than leaving a smoking wreckage behind.