sinara-hw / sinara

Sayma AMC/RTM issue tracker
Other
42 stars 7 forks source link

µTCA: no power to Sayma when RTM is plugged #571

Closed sbourdeauducq closed 5 years ago

sbourdeauducq commented 6 years ago

Blue and red LEDs on Sayma AMC blink rapidly, and all voltage LEDs on Sayma AMC are off.

jbqubit commented 6 years ago

The fix is easy - change initialisation sequence in MMC so it initialises GPIOs after MMC negotiates power.

Please link to updated MMC firmware.

gkasprow commented 6 years ago

@wizath please publish update once you implement the fix.

sbourdeauducq commented 6 years ago

@gkasprow @wizath ping

wizath commented 6 years ago

files.zip

Sorry for such delay, now I've got about 100 mA's current consumption at 3.3MP. Unfortunately, I don't have any RTM to test whole set.

hartytp commented 6 years ago

Nice, thanks!

sbourdeauducq commented 6 years ago

Nope.

ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(91): REQ(I2C=0x74) failed on bus 2 - no ACK
R(91,2,2)ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmi_SendFru(91): timeout - no response for REQ: 0x20->0x74, Seq=61 GET_DEVICE_ID_REQ
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
RTM2(91): Communication regained !
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): REQ(I2C=0x74) failed on bus 2 - no ACK
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8

etc. etc.

As before, errors disappear and AMC gets powered when RTM is removed.

sbourdeauducq commented 6 years ago

@wizath Where are your RTMs? Should I send you one immediately?

gkasprow commented 6 years ago

@sbourdeauducq do all the LEDs on the AMC panel blink quickly?

sbourdeauducq commented 6 years ago

No, only the green one at the bottom is blinking.

gkasprow commented 6 years ago

so it is not related with the issue that we solved recently.

gkasprow commented 6 years ago

@marmeladapk can you please try the recent firmware on the 6-slot chassis you have in Maryland?

gkasprow commented 6 years ago

Since it does not seem to be related with overcurrent, you can try to enable the power with script. You have to connect Ethetnet, open MCH config page and load the script that Joe posted some time ago. This enables 12V continuously. Do not try to do life insertion because this may kill the AMC. I can give you detailed instructions tomorrow once I get to my lab.

sbourdeauducq commented 6 years ago

I tried adding the following to the MCH settings file:

amc_pwr_on = 5, 100, 0
amc_pwr_on = 6, 100, 0
amc_pwr_on = 7, 100, 0
amc_pwr_on = 8, 100, 0

This has stopped the stream of error messages on the telnet interface (what a horrible user interface - you cannot type any command while the errors are scrolling, and a constant stream of errors seems to be a common occurrence with this IPMI trash). But, of course, the Sayma is still not powered - no matter if the RTM is plugged in or not.

gkasprow commented 5 years ago

are you sure that slot numbering in your chassis is as above?

sbourdeauducq commented 5 years ago

It's a NATIVE-R5 and I put the Sayma cards close to the MCH. From what I understand from the documentation, this is correct? The disappearance of the errors also suggests that I am targeting the correct slot numbers.

marmeladapk commented 5 years ago

@jbqubit has the same crate in his lab (and we encountered same problems as @sbourdeauducq).

gkasprow commented 5 years ago

@marmeladapk did you update firmware? Known overcurrent issue was fixed.

marmeladapk commented 5 years ago

Yes, an update that @wizath sent me didn't help.

wizath commented 5 years ago

You tried latest firmware? From https://github.com/sinara-hw/sinara/issues/571#issuecomment-426252827

marmeladapk commented 5 years ago

@wizath No, I tried the one you sent me on 26th Sept.

sbourdeauducq commented 5 years ago

@gkasprow Can you prioritize fixing this? When you advocated for µTCA you said it would be trouble-free, but the reality is very different; 2 years later the board still doesn't get power.

gkasprow commented 5 years ago

@jbqubit can you upgrade the MMC firmware and see if the problem persists? I cannot recreate this issue with my setup.

jbqubit commented 5 years ago

I updated MMC firmware to version recommended by @wizath. Now I see the following.

$ artiq_flash -t sayma start
Open On-Chip Debugger 0.10.0-00013-gbb7beda (2018-02-13-15:56)
Licensed under GNU GPL v2
For bug reports, read
    http://openocd.org/doc/doxygen/bugs.html
none separate
adapter speed: 5000 kHz
Info : clock speed 5000 kHz
Error: JTAG scan chain interrogation failed: all ones
Error: Check JTAG interface, timings, target power, etc.
Error: Trying to use configured scan chain anyway...
Error: xc7.tap: IR capture error; saw 0x3f not 0x01
Warn : Bypassing JTAG setup events due to errors
Info : gdb server disabled
RTM FPGA XADC:
TEMP 33028232.44 C
VCCINT 196608.000 V
VCCAUX 196608.000 V
VCCBRAM 196608.000 V
VPVN 196608.000 V
VREFP 196608.000 V
VREFN 196608.000 V
VCCPINT 196608.000 V
VCCPAUX 196608.000 V
VCCODDR 196608.000 V
AMC FPGA XADC:
TEMP 33028232.44 C
VCCINT 196608.000 V
VCCAUX 196608.000 V
VCCBRAM 196608.000 V
VPVN 196608.000 V
VREFP 196608.000 V
VREFN 196608.000 V
VCCPINT 196608.000 V
VCCPAUX 196608.000 V
VCCODDR 196608.000 V
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |    
| |  | | |___) | (_) | |___ 
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017-2018 M-Labs Limited

Bootloader CRC passed
Gateware ident 4.0.dev+1401.g20cddb6a.dirty;masterdac
Initializing SDRAM...
DQS initial delay: 111 taps
Write leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110100000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101111101000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101101000000000000000000000000000000000000000000000000000000000000000000000
DQS initial delay: 111 taps
Write leveling: 94 99 120 116 done
Read leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111011100010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 246+-89 237+-96 213+-86 195+-93 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000005s]  INFO(runtime): ARTIQ runtime starting...
[     0.003869s]  INFO(runtime): software ident 4.0.dev+1401.g20cddb6a.dirty;masterdac
[     0.011436s]  INFO(runtime): gateware ident 4.0.dev+1401.g20cddb6a.dirty;masterdac
[     0.019034s]  INFO(runtime): log level set to INFO by default
[     0.024732s]  INFO(runtime): UART log level set to INFO by default
[     0.030873s]  INFO(board_artiq::slave_fpga): Loading slave FPGA gateware...
[     0.037823s]  INFO(board_artiq::slave_fpga):   magic: 0x5352544d, length: 0x000bd5d0
[     0.045549s]  INFO(board_artiq::slave_fpga):   DONE before loading
[     1.036920s]  INFO(board_artiq::slave_fpga):   ...done
[     1.040731s]  INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[     1.075461s]  INFO(board_artiq::serwb):  ...done.
[     1.078830s]  INFO(board_artiq::serwb): RTM to AMC link test...
[     2.561133s]  INFO(board_artiq::serwb):   ...passed
[     2.564678s]  INFO(board_artiq::serwb): AMC to RTM link test...
[     4.046986s]  INFO(board_artiq::serwb):   ...passed
[     4.050539s]  INFO(board_artiq::serwb): Wishbone test...
[     5.982573s]  INFO(board_artiq::serwb):   ...passed
[     5.986428s]  INFO(board_artiq::serwb): RTM gateware version 4.0.dev+1401.g20cddb6a.dirty
[     5.994281s]  INFO(runtime): press 'e' to erase startup and idle kernels...
[     6.994006s]  INFO(runtime): continuing boot
[     7.256352s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[    10.997300s]  INFO(board_artiq::si5324):   ...locked
[    11.001064s]  INFO(board_artiq::hmc830_7043::hmc830): loading HMC830 configuration...
[    11.008998s]  INFO(board_artiq::hmc830_7043::hmc830):   ...done
[    11.014651s]  INFO(board_artiq::hmc830_7043::hmc830): setting HMC830 dividers...
[    11.022115s]  INFO(board_artiq::hmc830_7043::hmc830):   ...done
[    11.027937s]  INFO(board_artiq::hmc830_7043::hmc830): waiting for HMC830 lock...
[    11.035356s]  INFO(board_artiq::hmc830_7043::hmc830):   ...locked
[    11.041586s]  INFO(board_artiq::hmc830_7043::hmc7043): enabling HMC7043
[    11.058288s]  INFO(board_artiq::hmc830_7043::hmc7043): loading configuration...
[    11.075772s]  INFO(board_artiq::hmc830_7043::hmc7043):   ...done
[    11.080454s]  INFO(board_artiq::hmc830_7043::hmc7043): testing GPO...
[    11.087567s]  INFO(board_artiq::hmc830_7043::hmc7043):   ...passed
[    11.103781s]  INFO(board_artiq::ad9154): AD9154-0 initializing...
[    11.115584s]  INFO(board_artiq::ad9154):   ...done
[    11.189460s]  INFO(board_artiq::ad9154): AD9154-0 running PRBS test...
[    12.195823s]  INFO(board_artiq::ad9154):   ...passed
[    12.199471s]  INFO(board_artiq::ad9154): AD9154-0 running STPL test...
[    12.206259s]  INFO(board_artiq::ad9154):   c0 errors: 0
[    12.211463s]  INFO(board_artiq::ad9154):   c1 errors: 0
[    12.216673s]  INFO(board_artiq::ad9154):   c2 errors: 0
[    12.221882s]  INFO(board_artiq::ad9154):   c3 errors: 0
[    12.226807s]  INFO(board_artiq::ad9154):   ...passed
[    12.241790s]  INFO(board_artiq::ad9154): AD9154-0 initializing...
[    12.249204s]  INFO(board_artiq::ad9154):   ...done
[    12.333687s]  INFO(board_artiq::ad9154): AD9154-1 initializing...
[    12.345446s]  INFO(board_artiq::ad9154):   ...done
[    12.419310s]  INFO(board_artiq::ad9154): AD9154-1 running PRBS test...
[    13.425663s]  INFO(board_artiq::ad9154):   ...passed
[    13.429309s]  INFO(board_artiq::ad9154): AD9154-1 running STPL test...
[    13.436094s]  INFO(board_artiq::ad9154):   c0 errors: 0
[    13.441304s]  INFO(board_artiq::ad9154):   c1 errors: 0
[    13.446513s]  INFO(board_artiq::ad9154):   c2 errors: 0
[    13.451723s]  INFO(board_artiq::ad9154):   c3 errors: 0
[    13.456645s]  INFO(board_artiq::ad9154):   ...passed
[    13.471631s]  INFO(board_artiq::ad9154): AD9154-1 initializing...
[    13.479046s]  INFO(board_artiq::ad9154):   ...done
[    13.552951s]  INFO(board_artiq::jesd204sync): verifying SYSREF margins at DAC-0...
[    13.649231s]  INFO(board_artiq::jesd204sync):   margins: -36 +33
[    13.655153s]  INFO(board_artiq::jesd204sync): verifying SYSREF margins at DAC-1...
[    13.750257s]  INFO(board_artiq::jesd204sync):   margins: -2 +66
[    13.754856s] ERROR(runtime): failed to align SYSREF at DAC: SYSREF margins at DAC are too small, board needs recalibration
[    13.765888s]  INFO(board_artiq::hmc542): card 0 channel 0 set to 4 dB
[    13.774395s]  INFO(board_artiq::hmc542): card 0 channel 1 set to 4 dB
[    13.781611s]  INFO(board_artiq::hmc542): card 1 channel 0 set to 4 dB
[    13.788827s]  INFO(board_artiq::hmc542): card 1 channel 1 set to 4 dB
[    13.796044s]  INFO(board_artiq::hmc542): card 2 channel 0 set to 4 dB
[    13.803260s]  INFO(board_artiq::hmc542): card 2 channel 1 set to 4 dB
[    13.810477s]  INFO(board_artiq::hmc542): card 3 channel 0 set to 4 dB
[    13.817693s]  INFO(board_artiq::hmc542): card 3 channel 1 set to 4 dB
[    13.824957s]  INFO(runtime): using MAC address 34-45-32-12-ff-20
[    13.829723s]  INFO(runtime): using IP address 192.168.1.129
[    13.837058s]  INFO(runtime::mgmt): management interface active
[    13.850193s]  INFO(runtime::session): accepting network sessions
[    13.855698s]  INFO(runtime::session): running startup kernel
[    13.860645s]  INFO(runtime::session): no startup kernel found
[    13.866348s]  INFO(runtime::session): no connection, starting idle kernel
[    13.873219s]  INFO(runtime::session): no idle kernel found
[   262.074998s]  INFO(runtime::session): new connection from 192.168.1.145:41988
[   262.108619s]  INFO(runtime::kern_hwreq): resetting RTIO
[   349.543381s]  INFO(runtime::session): no connection, starting idle kernel
[   349.549240s]  INFO(runtime::session): no idle kernel found
[   356.142295s]  INFO(runtime::session): new connection from 192.168.1.145:41990
[   356.177971s]  INFO(runtime::kern_hwreq): resetting RTIO
[   356.182161s] ERROR(runtime::rtio_mgt): RTIO sequence error involving channel 46
[   356.189405s] ERROR(runtime::rtio_mgt): RTIO collision involving channel 6
hartytp commented 5 years ago

Okay, good, we're getting there...

can /not/ boot AMC+RTM in NATIVE-R9 with only one PSU in crate

@gkasprow any ideas what the cause of this should be?

Am I right in thinning that @sbourdeauducq only has one PSU in his crate, so this is consistent with the issues he was having?

sbourdeauducq commented 5 years ago

@sbourdeauducq only has one PSU in his crate

That is correct.

gkasprow commented 5 years ago

@jbqubit can you connect terminal to the MMC and see what it says? Did you try to run Natview software and connect with MCH via Ethernet? We have identical configuration and it works, so it looks like some PSU or backplane settings

gkasprow commented 5 years ago

@jbqubit can you make a video showing behaviour of Sayma front panel LEDs while booting? Do they blink in any way?

gkasprow commented 5 years ago

@sbourdeauducq I have some idea what could go wrong in your crate. Please do one thing - short pin 1 and pin 8 of T3 and short pin 1 and pin 8 of T14. I marked it with blue line on PCB drawing. The transistors reside close to the RTM connector. obraz

And check if AMC and RTM gets power. Do not insert RTM board into powered crate because both AMC or RTM can be damaged. If only AMC gets power, please also short pin 1 and pin 2 of T5. obraz

If RTM gets power, please remove short on T3 - the one closer to the board edge and check if it still gets power.

sbourdeauducq commented 5 years ago

OK, will do when I come back from travels (in 3/4 weeks)

gkasprow commented 5 years ago

@jbqubit could you do such test as well? Just short transistors and let me know if it boots from single supply.

marmeladapk commented 5 years ago

@bradbqc shorted those transistors earlier and now his Sayma is restarting every ~60 seconds as he wrote in https://github.com/m-labs/artiq/issues/1064#issuecomment-442926743 . This doesn't depend on those shorts as they're removed right now. He's using latest mmc firmware.

gkasprow commented 5 years ago

did he measure current consumed by RTM from 3V3MP? Is the same problem with RTM plugged?

marmeladapk commented 5 years ago

Without RTM connected Sayma doesn't reboot. @bradbqc, could you post your measurements?

sbourdeauducq commented 5 years ago

@gkasprow AFAICT this is still not resolved and will continue to be a problem with Sayma v2. Do you have a crate to reproduce it?

gkasprow commented 5 years ago

Did you try to short these transistors?

sbourdeauducq commented 5 years ago

No. Should I expect a different result than https://github.com/sinara-hw/sinara/issues/571#issuecomment-442969236 ?

gkasprow commented 5 years ago

It's quite possible that the board is damaged in some way. THe boards suffered a lot during transport.

gkasprow commented 5 years ago

The same board was working correctly before and got power from uTCA.

sbourdeauducq commented 5 years ago

Please do one thing - short pin 1 and pin 8 of T3 and short pin 1 and pin 8 of T14.

RTM now gets power, AMC front panel LEDs are on but not power LEDs.

sbourdeauducq commented 5 years ago

If RTM gets power, please remove short on T3 - the one closer to the board edge and check if it still gets power.

No effect.

sbourdeauducq commented 5 years ago

Well, those results probably don't mean much, because I can't get any AMC to power up anymore, even without the shorts and without the RTM, and on two different AMCs. This µTCA stuff is really obnoxious.

sbourdeauducq commented 5 years ago

Okay, it turns out that I had amc_pwr_on lines forgotten in the MCH configuration, and those interfere with Sayma power (It's another prime example of µTCA design that a feature supposed to force board power on actually turns them off...)

With T14 shorted both AMC and RTM now turn on! Now trying to short both T14 and T3...

sbourdeauducq commented 5 years ago

Also works (AMC+RTM) with both T14 and T3 transistors shorted.

sbourdeauducq commented 5 years ago

Tried another AMC. Also working with both transistors shorted.

sbourdeauducq commented 5 years ago

@gkasprow Are there further tests that you want me to do, or is this sufficient for you to determine what the problem is and ensure that Sayma v2 power will work properly without any reworks?

gkasprow commented 5 years ago

That's all. Thanks

hartytp commented 5 years ago

Nice! Great work all.

Will someone take care of documenting this all so that it's easy for people to set the racks up in future?

sbourdeauducq commented 5 years ago

@gkasprow Also, I suppose hotplug should be carefully tested if we claim it is supported.

jbqubit commented 5 years ago

HT3 calls for extensive power system testing related to uTCA, MCH, AMC and RTM cards. I added a new item to call explicitly for publishing the MCH configuration file.

https://github.com/sinara-hw/sinara/issues/601#issuecomment-449977797

jbqubit commented 5 years ago

For comparison this is the configuration file I'm using.

https://github.com/jbqubit/sinara-testing/blob/master/sayma/tools/nat_mch_startup_cfg_sinara.txt

@sbourdeauducq If your boards are now working in the crate please close.