openbmc / skeleton

Area for initial design & reference code. Will be replaced with proper implementation and design docs.
Apache License 2.0
10 stars 58 forks source link

Host watchdog timeout when RMC communicates with BMC continually #127

Closed johnhcwang closed 8 years ago

johnhcwang commented 8 years ago

Sometimes BMC needs more than 30s to finish the jobs when RMC communicates with BMC continually. Per OpenPower Host IPMI Spec (here), it specify the timer should be 30s. We need to improve the performance or change the spec to extend the poll interval. Any advice?

Step 1: Login to BMC by curl command

$ curl -c cjar -b cjar -k -X POST -H "Content-Type: application/json" -d '{"data": [ "root", "0penBmc" ] }' https://10.32.9.38/login
{
  "data": "User 'root' logged in", 
  "message": "200 OK", 
  "status": "ok"
}
$ curl -c cjar -b cjar -k https://10.32.9.38/org/openbmc/inventory/system/chassis/motherboard/bmc
{
  "data": {
    "fault": "False", 
    "fru_type": "BMC", 
    "is_fru": 0, 
    "manufacturer": "ASPEED", 
    "present": "False", 
    "version": "v1.0.3-dirty"
  }, 
  "message": "200 OK", 
  "status": "ok"

Step 2: Create 5 sessions to get inventory continually and make sure that all of them can get the information correctly

$for x in `seq 1 100`;do curl -b cjar -k https://10.32.9.38/org/openbmc/inventory/system/chassis/motherboard/enumerate ;done &
$for x in `seq 1 100`;do curl -b cjar -k https://10.32.9.38/org/openbmc/inventory/system/chassis/motherboard/enumerate ;done &
$for x in `seq 1 100`;do curl -b cjar -k https://10.32.9.38/org/openbmc/inventory/system/chassis/motherboard/enumerate ;done &
$for x in `seq 1 100`;do curl -b cjar -k https://10.32.9.38/org/openbmc/inventory/system/chassis/motherboard/enumerate ;done &
$for x in `seq 1 100`;do curl -b cjar -k https://10.32.9.38/org/openbmc/inventory/system/chassis/motherboard/enumerate ;done &

Step 3: Power on the system and then watchdog timeout sometimes

Jul 21 00:51:31 barreleye system_manager.py[982]: Turn on power and boot
Jul 21 00:51:31 barreleye system_manager.py[982]: Running System State: HOST_POWERING_ON
Jul 21 00:51:31 barreleye system_manager.py[982]: Setting watchdog poll interval to: 30000
Jul 21 00:51:31 barreleye system_manager.py[982]: Starting watchdog with poll interval: 30000
 = None
ftgmac100: NCSI interface down
Jul 21 00:51:33 barreleye kernel: ftgmac100: NCSI interface down
Jul 21 00:51:33 barreleye kernel[972]: ftgmac100: NCSI interface down
Jul 21 00:51:33 barreleye system_manager.py[982]: Running System State: HOST_POWERED_ON
Jul 21 00:51:33 barreleye system_manager.py[982]: Starting process: /usr/sbin/goto_system_state.py INVENTORY_UPLOADED inventory_upload.py: inventory_upload
ftgmac100: NCSI interface up
Jul 21 00:51:34 barreleye kernel: ftgmac100: NCSI interface up
Jul 21 00:51:34 barreleye kernel[972]: ftgmac100: NCSI interface up
^[
Jul 21 00:51:55 barreleye system_manager.py[982]: ERROR (name_properties): "name" property in /bmc/inventory/hdd_backplane is incorrect ("BARRELEYE Exander" instead of base node name)
Jul 21 00:51:55 barreleye system_manager.py[982]: ERROR (name_properties): "name" property in /bmc/inventory/sas_expander is incorrect ("BARRELEYE BPx15" instead of base node name)
Jul 21 00:51:55 barreleye system_manager.py[982]: ERROR (name_properties): "name" property in /bmc/inventory/misc is incorrect ("BARRELEYE MB" instead of base node name)
Jul 21 00:51:55 barreleye system_manager.py[982]: ERROR: Input tree has errors, aborting (use -f to force output)
Jul 21 00:51:56 barreleye system_manager.py[982]: Partition 'BMC_INV' not found !
Jul 21 00:51:56 barreleye system_manager.py[982]: Uploading inventory to PNOR in dtb format...
Jul 21 00:51:56 barreleye system_manager.py[982]: Running System State: INVENTORY_UPLOADED
Jul 21 00:51:56 barreleye system_manager.py[982]: Starting process: /usr/sbin/pcie_slot_present.exe: pcie_present
Jul 21 00:51:56 barreleye system_manager.py[982]: Setting fans to max
Jul 21 00:51:56 barreleye system_manager.py[982]: Booting host
Jul 21 00:51:57 barreleye system_manager.py[982]: Checking Presence: SLOT0_RISER_PRESENT
Jul 21 00:51:57 barreleye system_manager.py[982]: GPIO Lookup:  SLOT0_RISER_PRESENT = 424,in
Jul 21 00:51:57 barreleye system_manager.py[982]: Checking Presence: SLOT1_RISER_PRESENT
Jul 21 00:51:57 barreleye system_manager.py[982]: GPIO Lookup:  SLOT1_RISER_PRESENT = 425,in
Jul 21 00:51:57 barreleye system_manager.py[982]: Checking Presence: SLOT2_RISER_PRESENT
Jul 21 00:51:57 barreleye system_manager.py[982]: GPIO Lookup:  SLOT2_RISER_PRESENT = 426,in
Jul 21 00:51:58 barreleye system_manager.py[982]: Checking Presence: SLOT0_PRESENT
Jul 21 00:51:58 barreleye system_manager.py[982]: GPIO Lookup:  SLOT0_PRESENT = 427,in
Jul 21 00:51:58 barreleye system_manager.py[982]: Using primary side of the bios flash
Jul 21 00:51:58 barreleye system_manager.py[982]: Checking Presence: SLOT1_PRESENT
Jul 21 00:51:58 barreleye system_manager.py[982]: GPIO Lookup:  SLOT1_PRESENT = 428,in
Jul 21 00:51:58 barreleye system_manager.py[982]: Checking Presence: SLOT2_PRESENT
Jul 21 00:51:58 barreleye system_manager.py[982]: Running System State: HOST_BOOTING
Jul 21 00:51:59 barreleye system_manager.py[982]: GPIO Lookup:  SLOT2_PRESENT = 429,in
Jul 21 00:51:59 barreleye system_manager.py[982]: Checking Presence: MEZZ0_PRESENT
Jul 21 00:51:59 barreleye system_manager.py[982]: GPIO Lookup:  MEZZ0_PRESENT = 432,in
Jul 21 00:51:59 barreleye system_manager.py[982]: Checking Presence: MEZZ1_PRESENT
Jul 21 00:51:59 barreleye system_manager.py[982]: GPIO Lookup:  MEZZ1_PRESENT = 433,in
Jul 21 00:52:01 barreleye system_manager.py[982]: Polling watchdog: 0
Jul 21 00:52:01 barreleye system_manager.py[982]: Watchdog Error, Hard Rebooting
Jul 21 00:52:01 barreleye system_manager.py[982]: Turn off power
Jul 21 00:52:01 barreleye system_manager.py[982]: Stopping watchdog
Jul 21 00:52:01 barreleye system_manager.py[982]: Running System State: HOST_POWERING_OFF
ftgmac100: NCSI interface up
Jul 21 00:52:03 barreleye kernel: ftgmac100: NCSI interface up
Jul 21 00:52:03 barreleye kernel[972]: ftgmac100: NCSI interface up
Jul 21 00:52:03 barreleye system_manager.py[982]: Running System State: HOST_POWERED_OFF
Jul 21 00:52:03 barreleye system_manager.py[982]: Deleting OCC device
Jul 21 00:52:03 barreleye system_manager.py[982]: Turn on power and boot
Jul 21 00:52:03 barreleye system_manager.py[982]: Deleting OCC device
Jul 21 00:52:03 barreleye system_manager.py[982]: Running System State: HOST_POWERING_ON

Step 4: Extend the watchdog timer to 300 seconds for debugging. We can see that BMC takes more than 30s to finish the jobs.

Jul 21 01:29:44 barreleye system_manager.py[918]: Turn on power and boot
Jul 21 01:29:44 barreleye system_manager.py[918]: Running System State: HOST_POWERING_ON
Jul 21 01:29:44 barreleye system_manager.py[918]: Setting watchdog poll interval to: 300000
 = None
Jul 21 01:29:44 barreleye system_manager.py[918]: Starting watchdog with poll interval: 300000
Jul 21 01:29:46 barreleye system_manager.py[918]: Running System State: HOST_POWERED_ON
ftgmac100: NCSI interface down
Jul 21 01:29:46 barreleye system_manager.py[918]: Starting process: /usr/sbin/goto_system_state.py INVENTORY_UPLOADED inventory_upload.py: inventory_upload
Jul 21 01:29:46 barreleye kernel: ftgmac100: NCSI interface down
Jul 21 01:29:46 barreleye kernel[923]: ftgmac100: NCSI interface down
ftgmac100: NCSI interface up
Jul 21 01:29:48 barreleye kernel: ftgmac100: NCSI interface up
Jul 21 01:29:48 barreleye kernel[923]: ftgmac100: NCSI interface up
Jul 21 01:30:08 barreleye system_manager.py[918]: ERROR (name_properties): "name" property in /bmc/inventory/hdd_backplane is incorrect ("BARRELEYE Exander" instead of base node name)
Jul 21 01:30:08 barreleye system_manager.py[918]: ERROR (name_properties): "name" property in /bmc/inventory/sas_expander is incorrect ("BARRELEYE BPx15" instead of base node name)
Jul 21 01:30:08 barreleye system_manager.py[918]: ERROR (name_properties): "name" property in /bmc/inventory/misc is incorrect ("BARRELEYE MB" instead of base node name)
Jul 21 01:30:08 barreleye system_manager.py[918]: ERROR: Input tree has errors, aborting (use -f to force output)
Jul 21 01:30:09 barreleye system_manager.py[918]: Partition 'BMC_INV' not found !
Jul 21 01:30:09 barreleye system_manager.py[918]: Uploading inventory to PNOR in dtb format...
Jul 21 01:30:09 barreleye system_manager.py[918]: Running System State: INVENTORY_UPLOADED
Jul 21 01:30:09 barreleye system_manager.py[918]: Starting process: /usr/sbin/pcie_slot_present.exe: pcie_present
Jul 21 01:30:09 barreleye system_manager.py[918]: Setting fans to max
Jul 21 01:30:10 barreleye system_manager.py[918]: Booting host
Jul 21 01:30:10 barreleye system_manager.py[918]: Checking Presence: SLOT0_RISER_PRESENT
Jul 21 01:30:11 barreleye system_manager.py[918]: GPIO Lookup:  SLOT0_RISER_PRESENT = 424,in
Jul 21 01:30:11 barreleye system_manager.py[918]: Checking Presence: SLOT1_RISER_PRESENT
Jul 21 01:30:11 barreleye system_manager.py[918]: GPIO Lookup:  SLOT1_RISER_PRESENT = 425,in
Jul 21 01:30:11 barreleye system_manager.py[918]: Checking Presence: SLOT2_RISER_PRESENT
Jul 21 01:30:12 barreleye system_manager.py[918]: GPIO Lookup:  SLOT2_RISER_PRESENT = 426,in
Jul 21 01:30:12 barreleye system_manager.py[918]: Using primary side of the bios flash
Jul 21 01:30:12 barreleye system_manager.py[918]: Checking Presence: SLOT0_PRESENT
Jul 21 01:30:13 barreleye system_manager.py[918]: GPIO Lookup:  SLOT0_PRESENT = 427,in
Jul 21 01:30:13 barreleye system_manager.py[918]: Running System State: HOST_BOOTING
Jul 21 01:30:13 barreleye system_manager.py[918]: Checking Presence: SLOT1_PRESENT
Jul 21 01:30:13 barreleye system_manager.py[918]: GPIO Lookup:  SLOT1_PRESENT = 428,in
Jul 21 01:30:14 barreleye system_manager.py[918]: Checking Presence: SLOT2_PRESENT
Jul 21 01:30:14 barreleye system_manager.py[918]: GPIO Lookup:  SLOT2_PRESENT = 429,in
Jul 21 01:30:14 barreleye system_manager.py[918]: Checking Presence: MEZZ0_PRESENT
Jul 21 01:30:14 barreleye system_manager.py[918]: GPIO Lookup:  MEZZ0_PRESENT = 432,in
Jul 21 01:30:14 barreleye system_manager.py[918]: Checking Presence: MEZZ1_PRESENT
Jul 21 01:30:14 barreleye system_manager.py[918]: GPIO Lookup:  MEZZ1_PRESENT = 433,in
Jul 21 01:30:18 barreleye system_manager.py[918]: Setting watchdog poll interval to: 120000
Jul 21 01:30:18 barreleye system_manager.py[918]: Stopping watchdog
Jul 21 01:30:18 barreleye system_manager.py[918]: Starting watchdog with poll interval: 120000
anoo1 commented 8 years ago

From the log, the 'inventory upload' process is taking ~15 seconds to execute, so we'll move the host watchdog to start after the 'inventory upload', right before the host is initialized.

williamspatrick commented 8 years ago

That makes sense. The watchdog is there to monitor for the host hanging, so we shouldn't start the timer until we are right about at the point where host code starts. Should we move it until after the power sequence and the magic sequence also?

williamspatrick commented 8 years ago

@anoo1 - Did this get fixed in 'master' also?

anoo1 commented 8 years ago

I need to do a pull request for it this week.