openbmc / bmcweb

A do everything Redfish, KVM, GUI, and DBus webserver for OpenBMC
Apache License 2.0
154 stars 131 forks source link

firmware update fails when using latest bmcweb #206

Closed geissonator closed 3 years ago

geissonator commented 3 years ago

Describe the bug WIth commit b623d9c1b6605978eb6158619bb43c79a9f543fd, our Hardware CI test on witherspoon was unable to do a code update to a new level of code once that change was on the system. Some discussion has occurred within https://gerrit.openbmc-project.xyz/c/openbmc/openbmc/+/43523.

Environment What OpenBMC platform was this found on: Witherspoon What specific OpenBMC versions did you use: https://gerrit.openbmc-project.xyz/c/openbmc/openbmc/+/43523

To Reproduce Steps to reproduce the behavior:

  1. Flash a build with that bmcweb commit. Then try to code update to a different level of code using the OpenBMC robot test case for code update.

Is this a regression Yes, we test this flow 10-20 times a day on this system and this is a solid failure from this commit.

Other Info @gkeishin did some looking internally at our system and said it looked like there's a momentary blip in time while the test case is polling to check the state of the image, that he gets an error returned on the REST API and that causes the test case to fail.

Next steps, we're trying to find someone on the IBM team to rebuild bmcweb with debug symbols on so we can get a journal of the debug logs when the error occurs.

geissonator commented 3 years ago

I managed to get a recreate with some debug enabled. Looks to me like the robot test code gets the image uploaded and then immediately queries the old REST API for the status of the image. That's where the new code introduced in b623d9c returns this new error, and causes the code update test to fail.

Specifically these traces:

Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [ERROR "openbmc_dbus_rest.hpp":1699] Bad dbus request error: generic:53
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [ERROR "openbmc_dbus_rest.hpp":1699] Bad dbus request error: generic:53
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [ERROR "openbmc_dbus_rest.hpp":1699] Bad dbus request error: generic:53

Full trace of sequence:


Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":259] 0xf7df68 Connection open, total 7
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "timer_queue.hpp":53] timer add inside: 0xe37a78 274
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":753] 0xf1ffa0 timer added: 0xe37a78 274
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":496] 0xf1ffa0 doReadHeaders
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [ERROR "http_connection.hpp":504] 0xf1ffa0 async_read_header 217 Bytes
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":699] 0xf1ffa0 timer cancelled: 0xe37a78 274
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "timer_queue.hpp":53] timer add inside: 0xe37a78 275
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":753] 0xf1ffa0 timer added: 0xe37a78 275
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":590] Starting quick deadline
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":598] 0xf1ffa0 doRead
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":605] 0xf1ffa0 async_read 30 Bytes
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":699] 0xf1ffa0 timer cancelled: 0xe37a78 275
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "timer_queue.hpp":53] timer add inside: 0xe37a78 276
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":753] 0xf1ffa0 timer added: 0xe37a78 276
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":699] 0xf1ffa0 timer cancelled: 0xe37a78 276
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":476] Fetch the client IP address
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [INFO "http_connection.hpp":326] Request:  0xf1ffa0 HTTP/1.1 POST /login ::ffff:<BMC_IP>
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "routing.hpp":1297] Matched rule '/login' 4 / 16
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: pam_succeed_if(webserver:auth): requirement "user ingroup redfish" was met by user "root"
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_response.hpp":120] calling completion handler
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_response.hpp":123] completion handler was valid
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [INFO "http_connection.hpp":411] Response: 0xf1ffa0 /login 200 keepalive=1
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "timer_queue.hpp":53] timer add inside: 0xe37a78 277
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":753] 0xf1ffa0 timer added: 0xe37a78 277
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":657] 0xf1ffa0 doWrite
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":665] 0xf1ffa0 async_write 775 bytes
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":699] 0xf1ffa0 timer cancelled: 0xe37a78 277
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":683] 0xf1ffa0 Clearing response
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_response.hpp":101] 0xf221f0 Clearing response containers
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":496] 0xf1ffa0 doReadHeaders
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [ERROR "http_connection.hpp":504] 0xf1ffa0 async_read_header 332 Bytes
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "authorization.hpp":106] [AuthMiddleware] X-Auth-Token authentication
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "timer_queue.hpp":53] timer add inside: 0xe37a78 278
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":753] 0xf1ffa0 timer added: 0xe37a78 278
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":563] Starting slow deadline
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":573] QueryParams:
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":598] 0xf1ffa0 doRead
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":605] 0xf1ffa0 async_read 0 Bytes
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":699] 0xf1ffa0 timer cancelled: 0xe37a78 278
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "timer_queue.hpp":53] timer add inside: 0xe37a78 279
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":753] 0xf1ffa0 timer added: 0xe37a78 279
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":699] 0xf1ffa0 timer cancelled: 0xe37a78 279
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_connection.hpp":476] Fetch the client IP address
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [INFO "http_connection.hpp":326] Request:  0xf1ffa0 HTTP/1.1 GET /xyz/openbmc_project/software/a94991db/attr/Activation ::ffff:<BMC_IP>
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "routing.hpp":1297] Matched rule '/xyz/<path>' 2 / 4
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "routing.hpp":1333] userName = root userRole = priv-admin
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "openbmc_dbus_rest.hpp":1649] handleGet: /xyz/openbmc_project/software/a94991db prop:Activation
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [ERROR "openbmc_dbus_rest.hpp":1699] Bad dbus request error: generic:53
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [ERROR "openbmc_dbus_rest.hpp":1699] Bad dbus request error: generic:53
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [ERROR "openbmc_dbus_rest.hpp":1699] Bad dbus request error: generic:53
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_response.hpp":120] calling completion handler
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [DEBUG "http_response.hpp":123] completion handler was valid
Jun 01 19:34:59 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:59) [INFO "http_connection.hpp":411] Response: 0xf1ffa0 /xyz/openbmc_project/software/a94991db/attr/Activation 404 keepalive=1```
geissonator commented 3 years ago

A GET on the same path works fine after the fact, so some sort of race condition, maybe with the new object getting on D-Bus or into mapper?

curl -k -H "X-Auth-Token: $TOKEN" -X GET https://${BMC_IP}/xyz/openbmc_project/software/a94991db/attr/Activation
{
  "data": "xyz.openbmc_project.Software.Activation.Activations.Ready",                                                                                                        
  "message": "200 OK",                                                                                                                                                        
  "status": "ok"                                                                                                                                                              
}
geissonator commented 3 years ago

We must have a really slow DBUS API because the image finished and un-tar'd 4 second earlier:

Jun 01 19:34:55 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:55) [INFO "http_connection.hpp":326] Request:  0xfa9e50 HTTP/1.1 POST /upload/image ::ffff:<BMC_IP>
Jun 01 19:34:55 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:55) [DEBUG "routing.hpp":1297] Matched rule '/upload/image' 4 / 48
Jun 01 19:34:55 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:55) [DEBUG "routing.hpp":1333] userName = root userRole = priv-admin
Jun 01 19:34:55 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:55) [DEBUG "image_upload.hpp":100] Writing file to /tmp/images/65a9b297-977f-418a-8b35-932e263a74fd
Jun 01 19:34:55 witherspoon-YL30UF74T02X phosphor-version-software-manager[492]: Untaring
Jun 01 19:34:55 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:55) [DEBUG "image_upload.hpp":63] Match fired
Jun 01 19:34:55 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:55) [DEBUG "image_upload.hpp":87] ending response
Jun 01 19:34:55 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:55) [DEBUG "http_response.hpp":120] calling completion handler
Jun 01 19:34:55 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:55) [DEBUG "http_response.hpp":123] completion handler was valid
Jun 01 19:34:55 witherspoon-YL30UF74T02X bmcweb[192]: (2021-06-01 19:34:55) [INFO "http_connection.hpp":411] Response: 0xfa9e50 /upload/image 200 keepalive=1
geissonator commented 3 years ago

Some further debug in the bmcweb code shows we're hitting errors when trying to use sdbusplus to "GetAll" properties under the mapper service.

Jun 01 21:26:46 witherspoon-YL30UF74T02X bmcweb[510]: (2021-06-01 21:26:46) [ERROR "openbmc_dbus_rest.hpp":1699] Bad dbus request error: for service xyz.openbmc_project.ObjectMapper for object /xyz/openbmc_project/software/a94991db for interface org.freedesktop.DBus.Introspectable

Jun 01 21:26:46 witherspoon-YL30UF74T02X bmcweb[510]: (2021-06-01 21:26:46) [ERROR "openbmc_dbus_rest.hpp":1699] Bad dbus request error: for service xyz.openbmc_project.ObjectMapper for object /xyz/openbmc_project/software/a94991db for interface org.freedesktop.DBus.Peer

Jun 01 21:26:46 witherspoon-YL30UF74T02X bmcweb[510]: (2021-06-01 21:26:46) [ERROR "openbmc_dbus_rest.hpp":1699] Bad dbus request error: for service xyz.openbmc_project.ObjectMapper for object /xyz/openbmc_project/software/a94991db for interface org.freedesktop.DBus.Properties

I'm not sure why these are failing but I'm wondering if we should really fail this entire call if we get a failure like this? The previous code (prior to b623d9c) would just keep looping. If the property was found, it was returned, if it was not, we'd just get an empty response.

Or maybe we should try and make this code a bit smarter and ignore the mapper service? Here's what we get when we call the busctl that this function is using:

busctl call xyz.openbmc_project.ObjectMapper /xyz/openbmc_project/object_mapper xyz.openbmc_project.ObjectMapper GetObject sas "/xyz/openbmc_project/software/a94991db" 0   

a{sas} 3 

"xyz.openbmc_project.ObjectMapper" 3 
  "org.freedesktop.DBus.Introspectable" "org.freedesktop.DBus.Peer" "org.freedesktop.DBus.Properties" 

"xyz.openbmc_project.Software.BMC.Updater" 9 
   "org.freedesktop.DBus.Introspectable" "org.freedesktop.DBus.Peer" "org.freedesktop.DBus.Properties"
   "xyz.openbmc_project.Association.Definitions" "xyz.openbmc_project.Common.FilePath"
   "xyz.openbmc_project.Object.Delete" "xyz.openbmc_project.Software.Activation"
   "xyz.openbmc_project.Software.ExtendedVersion" "xyz.openbmc_project.Software.Version" 

"xyz.openbmc_project.Software.Version" 7
   "org.freedesktop.DBus.Introspectable" "org.freedesktop.DBus.Peer" "org.freedesktop.DBus.Properties"
   "xyz.openbmc_project.Common.FilePath" "xyz.openbmc_project.Object.Delete" 
   "xyz.openbmc_project.Software.ExtendedVersion" "xyz.openbmc_project.Software.Version" 
geissonator commented 3 years ago

Seems to be a limitation of mapper hosting the objects. When I manually run the command I see the failure as well:

busctl call xyz.openbmc_project.ObjectMapper /xyz/openbmc_project/software/a94991db org.freedesktop.DBus.Properties GetAll s org.freedesktop.DBus.Introspectable
Call failed: Unknown object '/xyz/openbmc_project/software/a94991db'.

Running the same against the xyz.openbmc_project.Software.BMC.Updater works fine (returns 0)

busctl call xyz.openbmc_project.Software.BMC.Updater /xyz/openbmc_project/software/a94991db org.freedesktop.DBus.Properties GetAll s org.freedesktop.DBus.Introspectable
a{sv} 0

So we can just weed out xyz.openbmc_project.ObjectMapper services or probably not call GetAll on org.freedesktop.DBus.* interfaces.

gtmills commented 3 years ago

FYI @timlee66

geissonator commented 3 years ago

Quick fix up at https://gerrit.openbmc-project.xyz/c/openbmc/bmcweb/+/43692

geissonator commented 3 years ago

@timlee66 - maintainers decided to revert via https://gerrit.openbmc-project.xyz/c/openbmc/bmcweb/+/43720 so you'll need to dig into the mapper issue if you'd like to get your change back in.