sinara-hw / sinara

Sayma AMC/RTM issue tracker
Other
42 stars 7 forks source link

Sayma AMC 3.3V failure #567

Closed sbourdeauducq closed 5 years ago

sbourdeauducq commented 6 years ago

One of the Sayma AMC cards (F) has an intermittent 3.3V problem. The 3.3V rail fails between <1s and hours after power-up. Exar dump from one incident when it failed shortly after startup:

------------ Exar Dump ----------
GET_HOST_STS 0x2 0x4 0x3
GET_FAULT_STS 0x5 0x0 0x20
PWR_GET_STATUS 0x9 0xd 0x2
PWR_READ_VOLTAGE_CH1 0x10 66 66 V
PWR_READ_VOLTAGE_CH2 0x11 56 56 V
PWR_READ_VOLTAGE_CH3 0x12 120 120 V
PWR_READ_VOLTAGE_CH4 0x13 100 100 V
PWR_READ_VOLTAGE_IN 0x14 160 160 V
sbourdeauducq commented 6 years ago

@gkasprow Should I send you the board or is there something I can try to fix myself? This is happening very frequently now, and blocking the development of inter-board synchronization.

gkasprow commented 6 years ago

sure, send it to Technosystem

sbourdeauducq commented 6 years ago

The other AMC also begins to exhibit this bug now :(

sbourdeauducq commented 6 years ago

@gkasprow I'm still waiting for your detailed shipping instructions.

hartytp commented 6 years ago

@sbourdeauducq send it to TechnoSystems via any respectable carrier (FedEx/DHL/etc). I've never had an issue doing that.

sbourdeauducq commented 6 years ago

Let's see after Brexit :)

sbourdeauducq commented 6 years ago

Anyway, I received the information and will send it with DHL tomorrow.

jordens commented 6 years ago

NB for anyone who wants to debug this: looking at Exar's "UnivPMIC" the XR77129 register layout might be more or less the same as the XRP7724 which is also documented and has some examples

gkasprow commented 6 years ago

XR77129 is the same as 7725 but has higher voltage rating. 7725 supports someIntel management protocol that 7724 does not.

jbqubit commented 6 years ago

While waiting for Sayma to ship to HK, can M-Labs do testing via remote login to WUT system?

gkasprow commented 6 years ago

We can make it quickly tomorrow using TeamViewer.

sbourdeauducq commented 6 years ago

using TeamViewer

That won't be useful, we need SSH or (better) Mosh. Anyway I don't think it's worth it to access the WUT boards remotely, DHL between Poland and HK is rather fast.

gkasprow commented 6 years ago

I have 2 VPN accounts so we can do it.

jbqubit commented 6 years ago

@gkasprow How would @sbourdeauducq disable on-board 3.3V supply and supply 3.3V from bench top PSU? Is this even a good idea?

gkasprow commented 6 years ago

One doesnt have to disable. He can connect external power module or bench supply in parallel. Exar should simply disable the channel.

sbourdeauducq commented 6 years ago

This sounds like a bad idea, e.g. if the exar chip disabled the channel due to overcurrent.

jbqubit commented 6 years ago

Set current limit on your benchtop supply.

sbourdeauducq commented 6 years ago

All the lab power supplies that I have turn into current sources when the max current is reached. And we are typically leaving the board on all the time. So, unless I change the behavior of the PSU to make it work like a circuit breaker, the board might keep receiving its maximum current for days, which does not sound safe. And we do not currently have time or funding for this sort of Sayma hardware debugging. As I mentioned in another issue, the board that is left in HK isn't strongly affected by this bug yet; it typically behaves itself for 30min-1 day after being turned on. So it's not a huge impediment to development (except for inter-board synchronization, since this is the only board we have left), but since likely this bug will get worse, please investigate it quickly after receiving the board @gkasprow.

jbqubit commented 6 years ago

the board might keep receiving its maximum current for days, which does not sound safe.

There's no need for fuse/breaker when using a current limited supply. If board nominal current draw is X set current limit to (1+eps)*X for whatever eps is safe CW. But sounds like your board's supply isn't bad enough yet to warrant using a bench top power supply.

sbourdeauducq commented 6 years ago

@gkasprow Even though I used DHL as you instructed, the package is again stuck in Polish customs. Can you handle the import this time please? "Clearance will proceed after receiving instructions from the importer. Customer should contact DHL Customer Service if not reached by DHL"

sbourdeauducq commented 6 years ago

Still stuck in customs...

hartytp commented 6 years ago

:(

Do you use cocaine for padding when you ship them? Maybe EU customs are just suspicious of HK

gkasprow commented 6 years ago

This time it stucked in DHL, not polish post and this is huge difference :)

sbourdeauducq commented 6 years ago

Still stuck despite new paperwork sent yesterday.

sbourdeauducq commented 6 years ago

And in my experience, EU customs are suspicious of many things coming from small organizations outside EU; it is also a problem to receive most items from e.g. small US organizations into Germany or France. "Respectable carriers", as you call them, also exploit the customs mess to make money in pretty shady ways, see the end of http://www.minimachines.net/a-la-une/la-livraison-depuis-lasie-delais-prix-transporteurs-57204.

sbourdeauducq commented 6 years ago

Customs released it, hallelujah!

sbourdeauducq commented 6 years ago

@gkasprow Have you received the board?

gkasprow commented 6 years ago

@sbourdeauducq Yes, I received it a few minutes ago. Will investigate them tomorrow.

gkasprow commented 6 years ago

@sbourdeauducq the AMC is working stand-alone running ARTIQ together with RTM already 2 hours and nothing happens to the supply...

gkasprow commented 6 years ago

When I received it, the Exar chip started after a few seconds. I burned recent MMC firmware and so far it works.

gkasprow commented 6 years ago

The configuration of both Exar chips is also fine.

sbourdeauducq commented 6 years ago

Try the other board I sent you and which Technosystem received today. It also has this 3.3V bug. Is there a new MMC firmware? I thought I had flashed the latest one already. Why did it take a few seconds to start the Exar chip?

Sometimes (but more rarely) the 1.5V supply also fails.

gkasprow commented 6 years ago

It looks like there was old version of firmware which was waiting for MCH response

gkasprow commented 6 years ago

I will leave it over night and check tomorrow morning.

gkasprow commented 6 years ago

@sbourdeauducq I took the RTM you shipped back with annotation that HMC does not lock:

Booting from flash... Starting firmware. [ 0.000004s] INFO(satman): ARTIQ satellite manager starting... [ 0.005668s] INFO(satman): software version 4.0.dev+1219.g4eb26c00 [ 0.011930s] INFO(satman): gateware version 4.0.dev+1214.g729ce58f [ 0.018172s] INFO(board_artiq::slave_fpga): Loading slave FPGA gateware... [ 0.025120s] INFO(board_artiq::slave_fpga): magic: 0x5352544d, length: 0x000c15b4 [ 1.038593s] INFO(board_artiq::slave_fpga): ...done [ 1.042339s] INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready... [ 1.077067s] INFO(board_artiq::serwb): ...done. [ 1.080377s] INFO(board_artiq::serwb): RTM to AMC link test... [ 2.562678s] INFO(board_artiq::serwb): ...passed [ 2.566161s] INFO(board_artiq::serwb): AMC to RTM link test... [ 4.048469s] INFO(board_artiq::serwb): ...passed [ 4.051961s] INFO(board_artiq::serwb): Wishbone test... [ 5.983985s] INFO(board_artiq::serwb): ...passed [ 5.987468s] DEBUG(board_artiq::serwb): AMC serwb settings: [ 5.993025s] DEBUG(board_artiq::serwb): bitslip: 39 [ 5.998062s] DEBUG(board_artiq::serwb): ready: 1 [ 6.002837s] DEBUG(board_artiq::serwb): error: 0 [ 6.007613s] DEBUG(board_artiq::serwb): RTM serwb settings: [ 6.013178s] DEBUG(board_artiq::serwb): bitslip: 6 [ 6.018128s] DEBUG(board_artiq::serwb): ready: 1 [ 6.022904s] DEBUG(board_artiq::serwb): error: 0 [ 6.027912s] INFO(board_artiq::serwb): RTM gateware version 4.0.dev+1214.g729ce58f [ 6.295633s] INFO(board_artiq::si5324): waiting for Si5324 lock... [ 8.726003s] INFO(board_artiq::si5324): ...locked [ 8.729686s] INFO(board_artiq::hmc830_7043::hmc830): HMC830 found [ 8.735754s] INFO(board_artiq::hmc830_7043::hmc830): loading HMC830 configuration... [ 8.743797s] INFO(board_artiq::hmc830_7043::hmc830): ...done [ 8.749458s] INFO(board_artiq::hmc830_7043::hmc830): setting HMC830 dividers... [ 8.756919s] INFO(board_artiq::hmc830_7043::hmc830): ...done [ 8.762743s] INFO(board_artiq::hmc830_7043::hmc830): waiting for HMC830 lock... [ 8.770161s] INFO(board_artiq::hmc830_7043::hmc830): ...locked [ 8.776390s] INFO(board_artiq::hmc830_7043::hmc7043): enabling HMC7043 [ 8.793075s] INFO(board_artiq::hmc830_7043::hmc7043): HMC7043 found [ 8.798051s] INFO(board_artiq::hmc830_7043::hmc7043): loading configuration... [ 8.806895s] INFO(board_artiq::hmc830_7043::hmc7043): status=10 [ 8.811508s] INFO(board_artiq::hmc830_7043::hmc7043): ...done [ 8.817502s] INFO(board_artiq::hmc542): card 0 channel 0 set to 4 dB [ 8.826008s] INFO(board_artiq::hmc542): card 0 channel 1 set to 4 dB [ 8.833138s] INFO(board_artiq::hmc542): card 1 channel 0 set to 4 dB [ 8.840267s] INFO(board_artiq::hmc542): card 1 channel 1 set to 4 dB [ 8.847398s] INFO(board_artiq::hmc542): card 2 channel 0 set to 4 dB [ 8.854527s] INFO(board_artiq::hmc542): card 2 channel 1 set to 4 dB [ 8.861658s] INFO(board_artiq::hmc542): card 3 channel 0 set to 4 dB [ 8.868787s] INFO(board_artiq::hmc542): card 3 channel 1 set to 4 dB

gkasprow commented 6 years ago

what power level of 100MHz clock do you use?

gkasprow commented 6 years ago

And both AMC board you shipped to me (the ones covered with thick layer of dust :) ) don't have Ethernet clock line modification. So it's not surprise that Ethernet does not work. On of them has sticker saying that there is 1.5V bug another has 3.3V bug sticker. The one with 1.5V bug has PRBS issues.

But from power supply point of view, they are working fine. What power supply do you use?

gkasprow commented 6 years ago

Anyway, I will focus on GTP2 clock on this particular board.

gkasprow commented 6 years ago

on GTP CLK2 the DC component is 0.7V while on GTP CLK1 it is 0.4V. The DC value is set by FPGA due to capacitive coupling.

hartytp commented 6 years ago

is this some configuration issue with ARTIQ, or a hardware issue? Can we reproduce that observation with a simple design based on Xilinx IP?

gkasprow commented 6 years ago

The datasheet , p33 says The reference clock input structure is illustrated in Figure 2-1. The input is terminated internally with 50Ω on each leg to 4/5 MGTAVCC. The reference clock is instantiated in software with the IBUFDS_GTE2 software primitive. The ports and attributes controlling the reference clock input are tied to the IBUFDS_GTE2 software primitive.

gkasprow commented 6 years ago

So neither of these values I measured makes sense. I see only one option - modify my design that tests gigabit transceivers, instantiate clock inputs by Wizard and repeat measurements

hartytp commented 6 years ago

The data sheet you link to isn't for ultrascale. That's https://www.xilinx.com/support/documentation/user_guides/ug576-ultrascale-gth-transceivers.pdf see figure 2-1

hartytp commented 6 years ago

Are you sure that all the MGTAVCC pins are connected correctly, that MGTAVCC has the right voltage and that there are no nasty transients on it during startup?

hartytp commented 6 years ago

@gkasprow if you disconnect the HMC7043 what DC voltages do you measure on these clock inputs?

hartytp commented 6 years ago

Might also be interesting to stick a scope on those inputs and look at the DC voltage as Sayma boots up

gkasprow commented 6 years ago

It is still 4/5 AVCC. MGTAVCC is filtered VCCINT which I observed several times. Plan for today:

sbourdeauducq commented 6 years ago

@gkasprow @hartytp Please keep this issue on the 3.3V power supply failure topic.

sbourdeauducq commented 6 years ago

But from power supply point of view, they are working fine. What power supply do you use?

ATX.

sbourdeauducq commented 6 years ago

@gkasprow This is off topic here.