Closed sbourdeauducq closed 5 years ago
The fix is easy - change initialisation sequence in MMC so it initialises GPIOs after MMC negotiates power.
Please link to updated MMC firmware.
@wizath please publish update once you implement the fix.
@gkasprow @wizath ping
Sorry for such delay, now I've got about 100 mA's current consumption at 3.3MP. Unfortunately, I don't have any RTM to test whole set.
Nice, thanks!
Nope.
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(91): REQ(I2C=0x74) failed on bus 2 - no ACK
R(91,2,2)ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmi_SendFru(91): timeout - no response for REQ: 0x20->0x74, Seq=61 GET_DEVICE_ID_REQ
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
RTM2(91): Communication regained !
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
ipmiMsgSender(6): REQ(I2C=0x74) failed on bus 2 - no ACK
ipmiMsgSender(6): RSP(I2C=0x74) failed on bus 2 result -8
etc. etc.
As before, errors disappear and AMC gets powered when RTM is removed.
@wizath Where are your RTMs? Should I send you one immediately?
@sbourdeauducq do all the LEDs on the AMC panel blink quickly?
No, only the green one at the bottom is blinking.
so it is not related with the issue that we solved recently.
@marmeladapk can you please try the recent firmware on the 6-slot chassis you have in Maryland?
Since it does not seem to be related with overcurrent, you can try to enable the power with script. You have to connect Ethetnet, open MCH config page and load the script that Joe posted some time ago. This enables 12V continuously. Do not try to do life insertion because this may kill the AMC. I can give you detailed instructions tomorrow once I get to my lab.
I tried adding the following to the MCH settings file:
amc_pwr_on = 5, 100, 0
amc_pwr_on = 6, 100, 0
amc_pwr_on = 7, 100, 0
amc_pwr_on = 8, 100, 0
This has stopped the stream of error messages on the telnet interface (what a horrible user interface - you cannot type any command while the errors are scrolling, and a constant stream of errors seems to be a common occurrence with this IPMI trash). But, of course, the Sayma is still not powered - no matter if the RTM is plugged in or not.
are you sure that slot numbering in your chassis is as above?
It's a NATIVE-R5 and I put the Sayma cards close to the MCH. From what I understand from the documentation, this is correct? The disappearance of the errors also suggests that I am targeting the correct slot numbers.
@jbqubit has the same crate in his lab (and we encountered same problems as @sbourdeauducq).
@marmeladapk did you update firmware? Known overcurrent issue was fixed.
Yes, an update that @wizath sent me didn't help.
You tried latest firmware? From https://github.com/sinara-hw/sinara/issues/571#issuecomment-426252827
@wizath No, I tried the one you sent me on 26th Sept.
@gkasprow Can you prioritize fixing this? When you advocated for µTCA you said it would be trouble-free, but the reality is very different; 2 years later the board still doesn't get power.
@jbqubit can you upgrade the MMC firmware and see if the problem persists? I cannot recreate this issue with my setup.
I updated MMC firmware to version recommended by @wizath. Now I see the following.
$ artiq_flash -t sayma start
Open On-Chip Debugger 0.10.0-00013-gbb7beda (2018-02-13-15:56)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
none separate
adapter speed: 5000 kHz
Info : clock speed 5000 kHz
Error: JTAG scan chain interrogation failed: all ones
Error: Check JTAG interface, timings, target power, etc.
Error: Trying to use configured scan chain anyway...
Error: xc7.tap: IR capture error; saw 0x3f not 0x01
Warn : Bypassing JTAG setup events due to errors
Info : gdb server disabled
RTM FPGA XADC:
TEMP 33028232.44 C
VCCINT 196608.000 V
VCCAUX 196608.000 V
VCCBRAM 196608.000 V
VPVN 196608.000 V
VREFP 196608.000 V
VREFN 196608.000 V
VCCPINT 196608.000 V
VCCPAUX 196608.000 V
VCCODDR 196608.000 V
AMC FPGA XADC:
TEMP 33028232.44 C
VCCINT 196608.000 V
VCCAUX 196608.000 V
VCCBRAM 196608.000 V
VPVN 196608.000 V
VREFP 196608.000 V
VREFN 196608.000 V
VCCPINT 196608.000 V
VCCPAUX 196608.000 V
VCCODDR 196608.000 V
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
MiSoC Bootloader
Copyright (c) 2017-2018 M-Labs Limited
Bootloader CRC passed
Gateware ident 4.0.dev+1401.g20cddb6a.dirty;masterdac
Initializing SDRAM...
DQS initial delay: 111 taps
Write leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110100000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101111101000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101101000000000000000000000000000000000000000000000000000000000000000000000
DQS initial delay: 111 taps
Write leveling: 94 99 120 116 done
Read leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111011100010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 246+-89 237+-96 213+-86 195+-93 done
SDRAM initialized
Memory test passed
Booting from flash...
Starting firmware.
[ 0.000005s] INFO(runtime): ARTIQ runtime starting...
[ 0.003869s] INFO(runtime): software ident 4.0.dev+1401.g20cddb6a.dirty;masterdac
[ 0.011436s] INFO(runtime): gateware ident 4.0.dev+1401.g20cddb6a.dirty;masterdac
[ 0.019034s] INFO(runtime): log level set to INFO by default
[ 0.024732s] INFO(runtime): UART log level set to INFO by default
[ 0.030873s] INFO(board_artiq::slave_fpga): Loading slave FPGA gateware...
[ 0.037823s] INFO(board_artiq::slave_fpga): magic: 0x5352544d, length: 0x000bd5d0
[ 0.045549s] INFO(board_artiq::slave_fpga): DONE before loading
[ 1.036920s] INFO(board_artiq::slave_fpga): ...done
[ 1.040731s] INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[ 1.075461s] INFO(board_artiq::serwb): ...done.
[ 1.078830s] INFO(board_artiq::serwb): RTM to AMC link test...
[ 2.561133s] INFO(board_artiq::serwb): ...passed
[ 2.564678s] INFO(board_artiq::serwb): AMC to RTM link test...
[ 4.046986s] INFO(board_artiq::serwb): ...passed
[ 4.050539s] INFO(board_artiq::serwb): Wishbone test...
[ 5.982573s] INFO(board_artiq::serwb): ...passed
[ 5.986428s] INFO(board_artiq::serwb): RTM gateware version 4.0.dev+1401.g20cddb6a.dirty
[ 5.994281s] INFO(runtime): press 'e' to erase startup and idle kernels...
[ 6.994006s] INFO(runtime): continuing boot
[ 7.256352s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 10.997300s] INFO(board_artiq::si5324): ...locked
[ 11.001064s] INFO(board_artiq::hmc830_7043::hmc830): loading HMC830 configuration...
[ 11.008998s] INFO(board_artiq::hmc830_7043::hmc830): ...done
[ 11.014651s] INFO(board_artiq::hmc830_7043::hmc830): setting HMC830 dividers...
[ 11.022115s] INFO(board_artiq::hmc830_7043::hmc830): ...done
[ 11.027937s] INFO(board_artiq::hmc830_7043::hmc830): waiting for HMC830 lock...
[ 11.035356s] INFO(board_artiq::hmc830_7043::hmc830): ...locked
[ 11.041586s] INFO(board_artiq::hmc830_7043::hmc7043): enabling HMC7043
[ 11.058288s] INFO(board_artiq::hmc830_7043::hmc7043): loading configuration...
[ 11.075772s] INFO(board_artiq::hmc830_7043::hmc7043): ...done
[ 11.080454s] INFO(board_artiq::hmc830_7043::hmc7043): testing GPO...
[ 11.087567s] INFO(board_artiq::hmc830_7043::hmc7043): ...passed
[ 11.103781s] INFO(board_artiq::ad9154): AD9154-0 initializing...
[ 11.115584s] INFO(board_artiq::ad9154): ...done
[ 11.189460s] INFO(board_artiq::ad9154): AD9154-0 running PRBS test...
[ 12.195823s] INFO(board_artiq::ad9154): ...passed
[ 12.199471s] INFO(board_artiq::ad9154): AD9154-0 running STPL test...
[ 12.206259s] INFO(board_artiq::ad9154): c0 errors: 0
[ 12.211463s] INFO(board_artiq::ad9154): c1 errors: 0
[ 12.216673s] INFO(board_artiq::ad9154): c2 errors: 0
[ 12.221882s] INFO(board_artiq::ad9154): c3 errors: 0
[ 12.226807s] INFO(board_artiq::ad9154): ...passed
[ 12.241790s] INFO(board_artiq::ad9154): AD9154-0 initializing...
[ 12.249204s] INFO(board_artiq::ad9154): ...done
[ 12.333687s] INFO(board_artiq::ad9154): AD9154-1 initializing...
[ 12.345446s] INFO(board_artiq::ad9154): ...done
[ 12.419310s] INFO(board_artiq::ad9154): AD9154-1 running PRBS test...
[ 13.425663s] INFO(board_artiq::ad9154): ...passed
[ 13.429309s] INFO(board_artiq::ad9154): AD9154-1 running STPL test...
[ 13.436094s] INFO(board_artiq::ad9154): c0 errors: 0
[ 13.441304s] INFO(board_artiq::ad9154): c1 errors: 0
[ 13.446513s] INFO(board_artiq::ad9154): c2 errors: 0
[ 13.451723s] INFO(board_artiq::ad9154): c3 errors: 0
[ 13.456645s] INFO(board_artiq::ad9154): ...passed
[ 13.471631s] INFO(board_artiq::ad9154): AD9154-1 initializing...
[ 13.479046s] INFO(board_artiq::ad9154): ...done
[ 13.552951s] INFO(board_artiq::jesd204sync): verifying SYSREF margins at DAC-0...
[ 13.649231s] INFO(board_artiq::jesd204sync): margins: -36 +33
[ 13.655153s] INFO(board_artiq::jesd204sync): verifying SYSREF margins at DAC-1...
[ 13.750257s] INFO(board_artiq::jesd204sync): margins: -2 +66
[ 13.754856s] ERROR(runtime): failed to align SYSREF at DAC: SYSREF margins at DAC are too small, board needs recalibration
[ 13.765888s] INFO(board_artiq::hmc542): card 0 channel 0 set to 4 dB
[ 13.774395s] INFO(board_artiq::hmc542): card 0 channel 1 set to 4 dB
[ 13.781611s] INFO(board_artiq::hmc542): card 1 channel 0 set to 4 dB
[ 13.788827s] INFO(board_artiq::hmc542): card 1 channel 1 set to 4 dB
[ 13.796044s] INFO(board_artiq::hmc542): card 2 channel 0 set to 4 dB
[ 13.803260s] INFO(board_artiq::hmc542): card 2 channel 1 set to 4 dB
[ 13.810477s] INFO(board_artiq::hmc542): card 3 channel 0 set to 4 dB
[ 13.817693s] INFO(board_artiq::hmc542): card 3 channel 1 set to 4 dB
[ 13.824957s] INFO(runtime): using MAC address 34-45-32-12-ff-20
[ 13.829723s] INFO(runtime): using IP address 192.168.1.129
[ 13.837058s] INFO(runtime::mgmt): management interface active
[ 13.850193s] INFO(runtime::session): accepting network sessions
[ 13.855698s] INFO(runtime::session): running startup kernel
[ 13.860645s] INFO(runtime::session): no startup kernel found
[ 13.866348s] INFO(runtime::session): no connection, starting idle kernel
[ 13.873219s] INFO(runtime::session): no idle kernel found
[ 262.074998s] INFO(runtime::session): new connection from 192.168.1.145:41988
[ 262.108619s] INFO(runtime::kern_hwreq): resetting RTIO
[ 349.543381s] INFO(runtime::session): no connection, starting idle kernel
[ 349.549240s] INFO(runtime::session): no idle kernel found
[ 356.142295s] INFO(runtime::session): new connection from 192.168.1.145:41990
[ 356.177971s] INFO(runtime::kern_hwreq): resetting RTIO
[ 356.182161s] ERROR(runtime::rtio_mgt): RTIO sequence error involving channel 46
[ 356.189405s] ERROR(runtime::rtio_mgt): RTIO collision involving channel 6
Okay, good, we're getting there...
can /not/ boot AMC+RTM in NATIVE-R9 with only one PSU in crate
@gkasprow any ideas what the cause of this should be?
Am I right in thinning that @sbourdeauducq only has one PSU in his crate, so this is consistent with the issues he was having?
@sbourdeauducq only has one PSU in his crate
That is correct.
@jbqubit can you connect terminal to the MMC and see what it says? Did you try to run Natview software and connect with MCH via Ethernet? We have identical configuration and it works, so it looks like some PSU or backplane settings
@jbqubit can you make a video showing behaviour of Sayma front panel LEDs while booting? Do they blink in any way?
@sbourdeauducq I have some idea what could go wrong in your crate. Please do one thing - short pin 1 and pin 8 of T3 and short pin 1 and pin 8 of T14. I marked it with blue line on PCB drawing. The transistors reside close to the RTM connector.
And check if AMC and RTM gets power. Do not insert RTM board into powered crate because both AMC or RTM can be damaged. If only AMC gets power, please also short pin 1 and pin 2 of T5.
If RTM gets power, please remove short on T3 - the one closer to the board edge and check if it still gets power.
OK, will do when I come back from travels (in 3/4 weeks)
@jbqubit could you do such test as well? Just short transistors and let me know if it boots from single supply.
@bradbqc shorted those transistors earlier and now his Sayma is restarting every ~60 seconds as he wrote in https://github.com/m-labs/artiq/issues/1064#issuecomment-442926743 . This doesn't depend on those shorts as they're removed right now. He's using latest mmc firmware.
did he measure current consumed by RTM from 3V3MP? Is the same problem with RTM plugged?
Without RTM connected Sayma doesn't reboot. @bradbqc, could you post your measurements?
@gkasprow AFAICT this is still not resolved and will continue to be a problem with Sayma v2. Do you have a crate to reproduce it?
Did you try to short these transistors?
No. Should I expect a different result than https://github.com/sinara-hw/sinara/issues/571#issuecomment-442969236 ?
It's quite possible that the board is damaged in some way. THe boards suffered a lot during transport.
The same board was working correctly before and got power from uTCA.
Please do one thing - short pin 1 and pin 8 of T3 and short pin 1 and pin 8 of T14.
RTM now gets power, AMC front panel LEDs are on but not power LEDs.
If RTM gets power, please remove short on T3 - the one closer to the board edge and check if it still gets power.
No effect.
Well, those results probably don't mean much, because I can't get any AMC to power up anymore, even without the shorts and without the RTM, and on two different AMCs. This µTCA stuff is really obnoxious.
Okay, it turns out that I had amc_pwr_on
lines forgotten in the MCH configuration, and those interfere with Sayma power (It's another prime example of µTCA design that a feature supposed to force board power on actually turns them off...)
With T14 shorted both AMC and RTM now turn on! Now trying to short both T14 and T3...
Also works (AMC+RTM) with both T14 and T3 transistors shorted.
Tried another AMC. Also working with both transistors shorted.
@gkasprow Are there further tests that you want me to do, or is this sufficient for you to determine what the problem is and ensure that Sayma v2 power will work properly without any reworks?
That's all. Thanks
Nice! Great work all.
Will someone take care of documenting this all so that it's easy for people to set the racks up in future?
@gkasprow Also, I suppose hotplug should be carefully tested if we claim it is supported.
HT3 calls for extensive power system testing related to uTCA, MCH, AMC and RTM cards. I added a new item to call explicitly for publishing the MCH configuration file.
https://github.com/sinara-hw/sinara/issues/601#issuecomment-449977797
For comparison this is the configuration file I'm using.
https://github.com/jbqubit/sinara-testing/blob/master/sayma/tools/nat_mch_startup_cfg_sinara.txt
@sbourdeauducq If your boards are now working in the crate please close.
Blue and red LEDs on Sayma AMC blink rapidly, and all voltage LEDs on Sayma AMC are off.