sinara-hw / sinara

Sayma AMC/RTM issue tracker
Other
42 stars 7 forks source link

Sayma loses 1.8V supply #358

Closed sbourdeauducq closed 6 years ago

sbourdeauducq commented 6 years ago

After hours to days (depending on the board) the Sayma loses its 1.8V power supply until 12V is cycled. Happens on all 3 boards we have at m-labs, less frequently on sayma3 than on the other two. Seems unaffected by RTM presence.

gkasprow commented 6 years ago

@sbourdeauducq The Exar chip has quite good diagnostics of what happened. I keep powered Sayma AMC + RTM and nothing happens. But I had such issue in the past and simply changed Exar settings and it helped. THere are at least 3 options to read the status: 1. Do you have Arduino board and Windows machnine? There is ready to use Exar programmer software that can be burned to Arduino and with Exar Power Architect one can read diagnostics. 2. You can also read chip status from FPGA side, but with missing 1.8V rail it might be difficult. 3. You can also connect any I2C master (i.e. FTDI chip) to the Exar programmer connector and read the chip status. I'm just curious if missing voltage was caused by overload or overvoltage.

Did you plug the FMC-VHDCI board?

sbourdeauducq commented 6 years ago

Did you plug the FMC-VHDCI board?

No. It is plugged on our KC705 however, where it does not cause problems.

jbqubit commented 6 years ago

Any update on this? @sbourdeauducq Do you still see this problem?

gkasprow commented 6 years ago

It seems I have to add Exar chip status reporting over UART. In this way we would know what was reason of the failure - undervoltage, overvoltage, overcurrent, etc.

jbqubit commented 6 years ago

@gkasprow That sounds like a good idea.

gkasprow commented 6 years ago

I already added a feature that dumps Ethernet PHY content via UART, so will add Exar support as well. Register dump is initiated by front panel button.

gkasprow commented 6 years ago

I had such issue in the past that 1V8 not always started. I noticed that once 3.3V started and before 1.8V rail was enabled, there was already 2V on 1V8 rail and Exar caused over-voltage error and disabled that channel. Something supplied the 1V8 rail before the FPGA was configured. I changed the sequencing in such way that 1.8 wakes up first and the issue disappeared. It's worth looking in details what could cause such condition. 1V8 is used mainly for FPGA HP bank supply (banks 47,48,66). It could be some 3V3 logic signal that delivers such supply via IO pin. FMC was not present, so this was not the case. I tried to trace this issue but could not find any leak. It just looks like FPGA leaks the current between 3.3V rail and 1.8V rail. A few mA is sufficient..

sbourdeauducq commented 6 years ago

Register dump is initiated by front panel button.

It would be better if this can be triggered by some computer-controlled event, e.g. something being received on the UART.

gkasprow commented 6 years ago

sure, I will do that.

sbourdeauducq commented 6 years ago

@gkasprow Any progress on this issue? We constantly have to power-cycle Sayma boards here during debugging sessions (which are long enough already due to many other reasons) and this is causing quite a bit of a problem. Also quite often the 1V8 supply is not there upon power-up.

gkasprow commented 6 years ago

@sbourdeauducq I cannot recreate this problem in my boards. I have two Sayma AMC. I changed back the power sequencing but still no progress. It seem I have to ship you my board and you will ship yours back.

sbourdeauducq commented 6 years ago

AFAIK @jbqubit is also seeing it. Joe, do you confirm? Not just one board is affected, this is happening on all 3 boards we have here.

hartytp commented 6 years ago

Maybe some difference between the two setups, like: the PSU SB is using; cooling/airflow (thermal overload of a regulator); something plugged into the board; etc?

gkasprow commented 6 years ago

Could you observe with a scope the 12V power entry during startup and trigger the scope at 10V ? I experienced such issues when current limit was not sufficient. Some supplies may react too slowly even though the current limit is set high.

hartytp commented 6 years ago

Given how much of @sbourdeauducq's time this is wasting, dealing with it is a very high priority.

sbourdeauducq commented 6 years ago

Tried another 12V supply, problem is still there.

jbqubit commented 6 years ago

I don't see this particular Issue. But I'm not using the hardware for extended periods so might not catch it. @gkasprow What's the status of using reporting Exar chip status over UART?

gkasprow commented 6 years ago

@jbqubit I'm working on it but don't have access to the HW. The university and my lab are still closed. They open tomorrow.

jbqubit commented 6 years ago

@gkasprow Sounds good. Happy New Year everyone!

sbourdeauducq commented 6 years ago

I don't see this particular Issue. But I'm not using the hardware for extended periods so might not catch it.

@jbqubit Can you power up a Sayma board, leave it alone for a few days, and then check if the 1V8 LED is still on? It seems, the problem is more frequent when the RTM is present (but I have not collected hard numbers on that, this is just my impression), so connect the RTM for this test.

sbourdeauducq commented 6 years ago

cooling/airflow (thermal overload of a regulator);

That's a plausible theory, my boards are simply lying horizontally with short standoffs on a table without a fan, and they do get noticeably hot. Maybe I can try adding a small heatsink to the regulator. @gkasprow what is your cooling situation?

something plugged into the board

I don't have anything special plugged into them. If that helps, my 3x AMC + 2x RTM + 2x Allaki are drawing 5-6A at 12V (before any 1V8 failures).

gkasprow commented 6 years ago

@sbourdeauducq I never tried running these boards without fan. The RTM has hardwired thermal protection that switches the power off when board exceeds ~70 degrees. They dissipate a lot of power so operation without a fan is risky. With overheating FPGA you can easily get thermal shutdown.

sbourdeauducq commented 6 years ago

With just the heatsink and no active cooling, the FPGA itself gets just warm (at least with the bitstreams we're loading right now). And this is a problem on the AMC side, not RTM - the bug is still present without the RTM connected. Anyway, an overheating issue in my setup sounds pretty likely. I will have a look.

hartytp commented 6 years ago

@sbourdeauducq These boards were designed to run in chassis with pretty aggressive forced air cooling, so if you're operating them without any fans then I'd be more surprised if they did work. Look back over all the photos greg posted of his test setup. IIRC they all had large fans visible in the photos.

Hopefully this is just user error then (not supplying adequate cooling) and not actually an issue with Sayma.

sbourdeauducq commented 6 years ago

Adding fans did not significantly help. I will try with the uTCA chassis when I receive it, which has a lot of cooling power.

gkasprow commented 6 years ago

In my case the fans did the job. Do you have some Arduino board and Windows machine? This is needed to update Exar chips. We will add Exar ugrade via MMC but this feature was not tested yet.

hartytp commented 6 years ago

Greg, is the Exar update ready? If so, can you let me know where it is and how to apply it? Thanks! I'd like to have a go at reproducing this issue on our setup next week, and would like the better Exar diagnostics for that.

Also, can I check that I've understood how to power Sayma (sorry if this is written up somewhere and I've just missed it). As far as I understand it:

Is that all correct? Is there anything else I should know about powering Sayma? e.g. I remember there being some issues to do with inrush current surges when Sayma is first powered on. Was that fixed? Or are there any special requirements from the PSU I use for it? How are you currently powering it?

Thanks!

Edit: Okay, RTFM (which is starting to look really nice btw). Max 12V current for AMC + RTM is given as 3A. So, presumably a 5A (60W) PSU will be fine, right? Will probably use the same one that we're using for Kasli/EEMs (unless you can see any reason not to Greg?).

gkasprow commented 6 years ago

Greg, is the Exar update ready? If so, can you let me know where it is and how to apply it? Thanks! I'd like to have a go at reproducing this issue on our setup next week, and would like the better Exar diagnostics for that.

We didn't manage to do it yet - all the team was on holidays. Will do it this week.

Also, can I check that I've understood how to power Sayma (sorry if this is written up somewhere and I've just missed it). As far as I understand it:

  • Sayma RTM takes a single 12V supply from the AMC <-> RTM connector. (AFAICT we don't use any of the "low noise analog power supplies" provided by the RF BP).

Yes

  • Sayma AMC also takes a single 12V supply.

Yes

  • How many amps does this need to supply? The power supply must be rated at least 4A

  • This can be supplied from J4 (P12V0_Molex) which is a standard 4-pin Atx power connector, right?

Yes, but be careful, there are two 4-pin plugs in some ATX power supplies. One is for PCIe, 4+2 or 6 pin, another is split 4+2 or 4+4 for mainboard. We need to use the one that goes to the mainboard. Anyway the AMC board is protected against wrong polarity.

  • It can also be supplied from the AMC connector, J2 (PAYLOAD_PWR)

Yes, but then it is not protected agaisnt inverted polarity.

Is that all correct? Is there anything else I should know about powering Sayma? e.g. I remember there being some issues to do with inrush current surges when Sayma is first powered on. Was that fixed? Or are there any special requirements from the PSU I use for it? How are you currently powering it?

Yes,this was fixed. There is a RC circuit that eliminates this inrush current of RTM board

Thanks!

Edit: Okay, RTFM (which is starting to look really nice btw). Max 12V current for AMC + RTM is given as 3A. So, presumably a 5A (60W) PSU will be fine, right? Will probably use the same one that we're using for Kasli/EEMs (unless you can see any reason not to Greg?).

This should work

hartytp commented 6 years ago

Thanks Greg!

gkasprow commented 6 years ago

@sbourdeauducq I modified the MMC code It dumps interesting Exar registers and Ethernet registers on request Ethernet PHY dump is initiated by 'E' character, Exar register redout can be done by 'P' character. Here is example dump obraz Binary file is here: lpc1776_ethernet_I2C_Exar_dump.zip

The components for media converter arrived, the PCBs were shipped today so should get them in 2 days.

sbourdeauducq commented 6 years ago

I'm not familiar with the Ethernet PHY registers. How does this firmware set up the Ethernet PHY? MII?

hartytp commented 6 years ago

Great, thanks for the update Greg!

gkasprow commented 6 years ago

it uses MDIO for this purpose. It connects PHY to the MMC processor, talks via MDIO and connects it back to the FPGA. Initial configuration is done via pinstrap, then I modify one register to set correct MII mode.

sbourdeauducq commented 6 years ago

Here's a dump from a 1.8V fault that appeared at startup:

------------ Exar Dump ----------
GET_HOST_STS 0x2 0x4 0x3
GET_FAULT_STS 0x5 0x0 0x40
PWR_GET_STATUS 0x9 0xb 0x4
PWR_READ_VOLTAGE_CH1 0x10 0.99 V
PWR_READ_VOLTAGE_CH2 0x11 3.36 V
PWR_READ_VOLTAGE_CH3 0x12 0.42 V
PWR_READ_VOLTAGE_CH4 0x13 1.50 V
PWR_READ_VOLTAGE_IN 0x14 11.20 V
sbourdeauducq commented 6 years ago

Same board when 1.8V starts up correctly:

------------ Exar Dump ----------
GET_HOST_STS 0x2 0x4 0x1
GET_FAULT_STS 0x5 0x0 0x0
PWR_GET_STATUS 0x9 0xf 0x0
PWR_READ_VOLTAGE_CH1 0x10 0.99 V
PWR_READ_VOLTAGE_CH2 0x11 3.36 V
PWR_READ_VOLTAGE_CH3 0x12 1.80 V
PWR_READ_VOLTAGE_CH4 0x13 1.50 V
PWR_READ_VOLTAGE_IN 0x14 11.20 V

Now waiting for it to fail...

gkasprow commented 6 years ago

We got SUPPLY_FAULT_EVENT and CH3 Fault which is obvious. But GET_FAULT_STS is 0x0 0x40 which means over voltage protection (OVP). Does this supply die after FPGA booting? Does it work when FPGA is not loaded? Give me bit file which causes such effect, I will try to recreate this issue. This is somehow consistent with my previous observation where I got 2V on not-supplied 1.8V rail and Exar caused OVP. But your case is different because you get 0.42V on this rail.

gkasprow commented 6 years ago

as a quick fix you can add 10R/ 0.5W resistor between 1.8V rail and GND (i.e. C523). It seem that the power supply needs to be loaded. Alternatively you can enable parallel termination on some unused IOs supplied from this rail. If you enable it for 20 IOs, the result will be similar.

gkasprow commented 6 years ago

C523 is here obraz

sbourdeauducq commented 6 years ago

Does this supply die after FPGA booting?

No, in this case it died immediately after startup. But sometimes it dies later (minutes/hours/days).

Does it work when FPGA is not loaded?

Whether the FPGA is loaded or not does not make a significant difference AFAICT.

Give me bit file which causes such effect,

See here: https://anaconda.org/m-labs/artiq-sayma_amc-standalone

By the way: sometimes I cannot dump the registers. I only get gibberish characters back on the UART. The transmission parameters are supposed to be 115200 8-N-1, correct?

sbourdeauducq commented 6 years ago

By the way: sometimes I cannot dump the registers. I only get gibberish characters back on the UART.

Potentially just another FTDI bug, which are plentiful. I'll try unplugging and replugging the USB cable...

gkasprow commented 6 years ago

Try with 10R resistor first.

hartytp commented 6 years ago

@sbourdeauducq I'm currently setting up a test setup here to see if I can reproduce this. In the mean time, if there is any way you could borrow a scope and look at the supply rails, that would be a massive help!

sbourdeauducq commented 6 years ago

Potentially just another FTDI bug, which are plentiful. I'll try unplugging and replugging the USB cable...

That was a MMC, not FTDI problem. Something on USB had put all three boards in LPC bootloader mode, and they would not leave it until the USB cables were physically unplugged and replugged. The gibberish characters I was getting were the bootloader replies.

The 1.8V bug manifested itself again today (I had not installed the 10R resistor yet), after the board had been running for ~15min. I immediately pressed E on the console and the MMC dumped the Ethernet registers. So the MMC was still working. Then I pressed P and the MMC froze without printing anything, and it became also unresponsive to E.

sbourdeauducq commented 6 years ago

9R resistor (2x18R in parallel) now installed on Sayma2.

sbourdeauducq commented 6 years ago

1.8V failed again on startup despite the resistor.

------------ Exar Dump ----------
GET_HOST_STS 0x2 0x4 0x3
GET_FAULT_STS 0x5 0x0 0x40
PWR_GET_STATUS 0x9 0xb 0x4
PWR_READ_VOLTAGE_CH1 0x10 0.99 V
PWR_READ_VOLTAGE_CH2 0x11 3.36 V
PWR_READ_VOLTAGE_CH3 0x12 0.03 V
PWR_READ_VOLTAGE_CH4 0x13 1.50 V
PWR_READ_VOLTAGE_IN 0x14 11.20 V
sbourdeauducq commented 6 years ago

@gkasprow Have you noticed that the 12V input a bit low on my setup? Is that likely to cause issues?

sbourdeauducq commented 6 years ago

1.8V failed again after the board had been powered for a while, the Exar dump was the same as above: https://github.com/m-labs/sinara/issues/358#issuecomment-356840335

gkasprow commented 6 years ago

such voltage difference does not really matter.

hartytp commented 6 years ago

@sbourdeauducq So far I still can't reproduce this in my setup. Unless that changes, it's going to be extremely hard to make progress on this unless you can give us scope traces of the power rails on your failing Saymas. Is there any chance you could borrow a scope from someone and do this?

gkasprow commented 6 years ago

The board in @sbourdeauducq place was operated for long time without any cooling so could overheat and something could be damaged. So I suggest to use other boards for development and ship problematic boards for inspection. On the boards I shipped to HK I did not place heatsink to give access to some components around the FPGA. The heatsinks were delivered with the board. Were they assembled?