redhat-performance / badfish

Vendor-agnostic tool for managing bare-metal systems via the Redfish API
https://quads.dev
GNU General Public License v3.0
93 stars 26 forks source link

--check-boot with -i config/idrac_interfaces.yml does not always report correct boot-order entry #56

Closed atheurer closed 3 years ago

atheurer commented 4 years ago

When the current boot order does not match foreman (or any entry in idrac_interfaces.yaml) the output mistakenly reports that it does match "foreman". See below where the actual boot order is nothing like the foreman boot order.

[root@e25-h21-740xd badfish]# ./badfish.py -u quads -p $password -i config/idrac_interfaces.yml -H mgmt-e25-h25-740xd --check-boot

[root@e25-h21-740xd badfish]# grep foreman_740 config/idrac_interfaces.yml foreman_740xd_interfaces: NIC.Integrated.1-3-1,HardDisk.List.1-1,NIC.Integrated.1-1-1

[root@e25-h21-740xd badfish]# ./badfish.py -u quads -p $password -H mgmt-e25-h25-740xd --check-boot

sadsfae commented 4 years ago

Hey @atheurer thanks for filing this. There are two potential causes that we've seen here

1) Dell device order is sometimes cached

We've seen this before, if it's the same issue it's less anything to do with badfish and more to do with Dell. Sometimes we have noticed that interface order is cached somewhere and reports incorrectly (in fact this has caused no end to our heart ache years ago).

Badfish simply reports what Redfish tells it, if you were to run the equivalent racadm command I think you'd see the same result:

e.g.

ssh quads@mgmt-e25-h21-740xd "racadm get BIOS.BiosBootSettings.Bootseq"
[Key=BIOS.Setup.1-1#BiosBootSettings]
BootSeq=
NIC.Integrated.1-3-1
NIC.Integrated.1-1-1
NIC.Slot.7-1-1
HardDisk.List.1-1

I believe we had brought this up some time ago with Dell but never gotten anywhere, if it's still happening on iDRAC9 (what 740xd and above will run) we should open another request to have them look at it.

2) Badfish only tries to manage the items listed in the YAML key/value pair.

Badfish only knows to set the boot order based on the key/value pairs in idrac_interfaces.yml so if either more or less entries need to be managed that should be modified there or alternatively on the BIOS side of the system less interfaces/devices should be members of the BiosBootSettings.Bootseq table.

Perhaps there are too many interfaces listed here that may not need to be listed on the BIOS side, for example in Scale Lab we only list a minimum at the BIOS level as only the 1st, second internal interface and public NIC are really required to have PXE set. Those are the ones used by application deployment requiring PXE on internal networks, their settings are reflected in idrac_interfaces.yml

Alternatively, try adding the rest of the interfaces listed in the right order to the idrac_interfaces.yml as key/value pairs like you want them and it should do the right thing.

atheurer commented 4 years ago

Here is what the system in question reports from racadm:

# ssh quads@mgmt-e25-h21-740xd "racadm get BIOS.BiosBootSettings.Bootseq"
quads@mgmt-e25-h21-740xd's password:
[Key=BIOS.Setup.1-1#BiosBootSettings]
BootSeq=NIC.Integrated.1-3-1,NIC.Integrated.1-1-1,NIC.Slot.7-2-1,NIC.Slot.7-1-1,HardDisk.List.1-1,NIC.Integrated.1-2-1,NIC.Integrated.1-4-1

And badfish's config:

# grep foreman_740 config/idrac_interfaces.yml
foreman_740xd_interfaces: NIC.Integrated.1-3-1,HardDisk.List.1-1,NIC.Integrated.1-1-1

What racadm reports is different from what foreman_740xd_interfaces is defined as, so I don't think it is a caching issue from racadm.

A run of badfish immediately after the previous commands:

# ./badfish.py -u quads -p $password -H mgmt-e25-h25-740xd --check-boot -i config/idrac_interfaces.yml
- INFO     - Systems service: /redfish/v1/Systems/System.Embedded.1.
- INFO     - Managers service: /redfish/v1/Managers/iDRAC.Embedded.1.
- WARNING  - Current boot order is set to: foreman.
sadsfae commented 4 years ago

Here is what the system in question reports from racadm:

ssh quads@mgmt-e25-h21-740xd "racadm get BIOS.BiosBootSettings.Bootseq"

quads@mgmt-e25-h21-740xd's password: [Key=BIOS.Setup.1-1#BiosBootSettings] BootSeq=NIC.Integrated.1-3-1,NIC.Integrated.1-1-1,NIC.Slot.7-2-1,NIC.Slot.7-1-1,HardDisk.List.1-1,NIC.Integrated.1-2-1,NIC.Integrated.1-4-1

And badfish's config:

grep foreman_740 config/idrac_interfaces.yml

foreman_740xd_interfaces: NIC.Integrated.1-3-1,HardDisk.List.1-1,NIC.Integrated.1-1-1

What racadm reports is different from what foreman_740xd_interfaces is defined as, so I don't think it is a caching issue from racadm.

A run of badfish immediately after the previous commands:

./badfish.py -u quads -p $password -H mgmt-e25-h25-740xd --check-boot -i config/idrac_interfaces.yml

  • INFO - Systems service: /redfish/v1/Systems/System.Embedded.1.
  • INFO - Managers service: /redfish/v1/Managers/iDRAC.Embedded.1.
  • WARNING - Current boot order is set to: foreman.

Hey @atheurer try setting the idrac_interfaces.yml r740xd strings key/value to the exact, full order of interface strings that you want to include any extra interfaces that racadm lists for both Foreman and Director lines and see if this sets it for you.

Badfish won't change what you don't tell it to, and there's a whole lot of extra devices listed there that at the system/BIOS level are enabled so they'll be in the mix. Badfish would try to sort them as told but extra, unlisted interfaces seems like the culprit here.

Possibly, all those interfaces aren't needed to be PXE eligible at the system side but let's see if this works. A simple fix in this case is to just add them to idrac_interfaces.yml for badfish to manage but probably they should be removed at some point on the system side from the BIOS table if they are not needed to reduce complexity.

@QuantumPosix please check that all interfaces that are in the BIOS PXE tables are also exactly how the boot order is for both foreman and director modes. I know that recently new devices had been added to the PXE/BIOS boot table than when the upstream idrac_interfaces.yml file had been updated. We can then update this so it lands not only here but our auto-built docker images so they are up to date to match ALIAS. Feel free to throw a pull request at the development branch here or ping us with the changes and we'll be happy to merge them.

atheurer commented 4 years ago

This github issue is not about asking badfish to change anything for me. It is about badfish incorrectly reporting that there is an entry [that represents an ordered list of boot devices] in the idrac_interfaces.yaml that matches exactly the ordered list of boot devices as reported by idrac. I don't understand why badfish, can't simply respond with "no matches found" when that is the case.

Based on your reply, I am getting the impression that badfish only works reliably if all of the enabled boot devices from the managed system must always be listed in the idrac_interfaces.yaml; otherwise there is confusion. If so, this, then at the very least the documentation should state that the user must first inventory all of their managed system's boot devices, then create their own idrac_interfaces.yaml to ensure it always includes the same devices.

sadsfae commented 4 years ago

This github issue is not about asking badfish to change anything for me. It is about badfish incorrectly reporting that there is an entry [that represents an ordered list of boot devices] in the idrac_interfaces.yaml that matches exactly the ordered list of boot devices as reported by idrac. I don't understand why badfish, can't simply respond with "no matches found" when that is the case.

Based on your reply, I am getting the impression that badfish only works reliably if all of the enabled boot devices from the managed system must always be listed in the idrac_interfaces.yaml; otherwise there is confusion. If so, this, then at the very least the documentation should state that the user must first inventory all of their managed system's boot devices, then create their own idrac_interfaces.yaml to ensure it always includes the same devices.

This is correct, but we can't anticipate how others labs/infra may change their gear to anticipate whether we should update/maintain idrac_interfaces.yml or not. The shipped file is more of an example with the sparse knowledge we know about other systems/types of other places that use badfish and not something we'd even know to keep updated.

Until now, the entries listed on idrac_interfaces.yml accurately described the interfaces in the BIOS PXE table for both Scale Lab and ALIAS (which we do not manage - we just author and maintain tooling that it consumes) so it did not need updating. Recently, there was a change to enable and add more interfaces to the BIOS PXE table on the Alias / Dell system side at the request of the OSP and/or OCS teams.

Keeping this update is best effort, for reasons outlined above that we can't always know what changes people make to their hardware outside of our purview but we'll happily update idrac_interfaces.yml.

In the meantime we're suggesting that you modify this yourself and see if it achieves what you want.

We'll update documentation further to make this more clear, in the meantime please let us know if adding the additional interface strings helps your situation and achieves parity in racadm output and behavior.

grafuls commented 4 years ago

This github issue is not about asking badfish to change anything for me. It is about badfish incorrectly reporting that there is an entry [that represents an ordered list of boot devices] in the idrac_interfaces.yaml that matches exactly the ordered list of boot devices as reported by idrac. I don't understand why badfish, can't simply respond with "no matches found" when that is the case.

Thanks for raising this. There is clearly an issue here since in this case badfish expected behaviour is to return the msg: "Current boot order does not match any of the given." followed by the list of devices in order as if -i parameter wasn't passed. I think I might have found the culprit for this issue and will be posting a patch soon.

Based on your reply, I am getting the impression that badfish only works reliably if all of the enabled boot devices from the managed system must always be listed in the idrac_interfaces.yaml; otherwise there is confusion. If so, this, then at the very least the documentation should state that the user must first inventory all of their managed system's boot devices, then create their own idrac_interfaces.yaml to ensure it always includes the same devices.

Will add this to the documentation as well, thanks for the suggestion.

bengland2 commented 4 years ago

I saw this too. What I object to is that it is unclear whether or not the -t command succeeded or not, and when I go into the BIOS I clearly see that it did not succeed. Be really clear, if you can't do something tell the user, will save a lot of aggro. When I do this:

bengland@localhost upi]$ podman run -it --rm quads/badfish -H mgmt-e26-h03-740xd -u quads -p 502328 -i config/idrac_interfaces.yml --check-boot
- INFO     - Systems service: /redfish/v1/Systems/System.Embedded.1.
- INFO     - Managers service: /redfish/v1/Managers/iDRAC.Embedded.1.
- WARNING  - Current boot order does not match any of the given.
- INFO     - Current boot order:
- INFO     - 1: NIC.Integrated.1-1-1
- INFO     - 2: NIC.Integrated.1-3-1
- INFO     - 3: NIC.Slot.7-1-1
- INFO     - 4: NIC.Slot.7-2-1
- INFO     - 5: NIC.Integrated.1-2-1
- INFO     - 6: NIC.Integrated.1-4-1
- INFO     - 7: HardDisk.List.1-1

It is completely wrong and in fact the BIOS reports what I set it to, C:, NIC7 port 1, and other interfaces after that. The DRAC GUI gets it wrong too! Configuration -> Boot Settings -> Set Boot Order Enable has the wrong stuff. Screenshot from 2020-02-11 14-16-13

QuantumPosix commented 4 years ago

With some testing it appears the functionality is correct and what we are seeing here are two different use cases. Above what has happened is the boot order was changed in bios to reflect something different than what was set in idrac_interfaces.yml. This is why HardDisk is not listed a second in the boot order. Ultimately the test with changing the boot order in idrac_interfaces.yml works as intended as the next setup / test:

Foreman setup in idrac_interfaces.yml

foreman_740xd_interfaces: NIC.Integrated.1-3-1,NIC.Integrated.1-1-1,NIC.Slot.7-2-1,NIC.Slot.7-1-1,HardDisk.List.1-1,NIC.Integrated.1-2-1,NIC.Integrated.1-4-1

--check-boot results:

--check-boot
- INFO     - Systems service: /redfish/v1/Systems/System.Embedded.1.
- INFO     - Managers service: /redfish/v1/Managers/iDRAC.Embedded.1.
- INFO     - Current boot order:
- INFO     - 1: NIC.Integrated.1-3-1
- INFO     - 2: NIC.Slot.7-1-1
- INFO     - 3: NIC.Slot.7-2-1
- INFO     - 4: HardDisk.List.1-1
- INFO     - 5: NIC.Integrated.1-1-1
- INFO     - 6: NIC.Integrated.1-2-1

Modified the foreman boot order in idrac_interfaces.yml:

foreman_740xd_interfaces: HardDisk.List.1-1,NIC.Slot.7-1-1

Ran badfish command on host:

-i config/idrac_interfaces.yml -t foreman

- INFO     - Systems service: /redfish/v1/Systems/System.Embedded.1.
- INFO     - Managers service: /redfish/v1/Managers/iDRAC.Embedded.1.
- INFO     - Job queue for iDRAC mgmt-host successfully cleared.
- WARNING  - Waiting for host to be up.
- INFO     - Polling for host state: On
  Polling: [------------------->] 100% - Host state: On  
- INFO     - PATCH command passed to update boot order.
- INFO     - POST command passed to create target config job.
- INFO     - JID_819694690271 job ID successfully created.
- INFO     - Command passed to check job status, code 200 returned.
- WARNING  - JobStatus not scheduled, current status is: New.
- INFO     - Command passed to check job status, code 200 returned.
- INFO     - Job id JID_819694690271 successfully scheduled.
- INFO     - Command passed to ForceOff server, code return is 204.
- INFO     - Polling for host state: Not Down
  Polling: [------------------->] 100% - Host state: Off
- INFO     - Command passed to On server, code return is 204.

Results from above command:

--check-boot
- INFO     - Systems service: /redfish/v1/Systems/System.Embedded.1.
- INFO     - Managers service: /redfish/v1/Managers/iDRAC.Embedded.1.
- INFO     - Current boot order:
- INFO     - 1: HardDisk.List.1-1
- INFO     - 2: NIC.Slot.7-1-1
- INFO     - 3: NIC.Slot.7-2-1
- INFO     - 4: NIC.Integrated.1-1-1
- INFO     - 5: NIC.Integrated.1-2-1
- INFO     - 6: NIC.Integrated.1-3-1

This properly keeps the boot order to what was set by the user. And if we do move forward with putting in checks we should only do so for how many devices are listed in the idrac_interfaces.yml for that specific type and then to match the position in the boot order, not to fail if there are additional. (this is due to the longer tasks of enabling / disabling pxe on interfaces, that functionality is missing from redfish api as a whole and having the other boot devices in the boot order still be present for a check boot for return and re-release of the reservation to a new tenant)

sadsfae commented 4 years ago

@bengland2 refer to our intention on idrac_interfaces.yml and its purpose - it won't always match every environment because we can't possibly know what every environment looks like.

https://github.com/redhat-performance/badfish#usage

It's a guide and reference, it will only manage what you tell it to. There's documentation to modify the key/value pairs and run it yourself as needed to match a different environment.

(container method via -v) or just modify it directly if running it from Github clone https://github.com/redhat-performance/badfish#usage-via-docker

We will do our best to change the example idrac_interfaces.yml to match our internal R&D labs but if there are recent changes sometimes that's not always reflected, as such the onus is on the user to modify accordingly to fit their environment.

We'll push a PR to make sure the idrac_interfaces.yml matches what is in ALIAS now as that environment has changed since this was updated, however you have all the means to change them yourself as needed. Nobody should be turning on/off PXE or needing to make BIOS changes, this really breaks our automation and causes delays handing off the hardware to other people.

We think there may still be a bug with --check-boot that we're looking into.

sadsfae commented 4 years ago

idrac_interfaces.yml should match the PXE-enabled interfaces now for r740xd - we're not able to yet reproduce any issues with --check-boot let us know otherwise please.

bengland2 commented 4 years ago

@sadsfae @dblack I spoke with Chris Enright and am trying out his method:

As long as this works then I am fine with not changing things in BIOS, and Chris can always change it back with "-t foreman" when the reservation is returned. I'm trying it today with my UPI install.

bengland2 commented 4 years ago

badfish is setting the boot order as described in comment .-1 . My only suggestion is that in --check-boot, the current boot order should be explicitly listed even when it matches what's in idrac_interfaces.yml, so that you know you got the boot order set correctly. Right now it only prints it out if it doesn't match. For example:

[bengland@localhost badfish]$ for h in 3 5 7 ; do python3 badfish.py -u quads -p 502328 -i config/idrac_interfaces.yml -H mgmt-e26-h0${h}-740xd --check-boot ; done
- INFO     - Systems service: /redfish/v1/Systems/System.Embedded.1.
- INFO     - Managers service: /redfish/v1/Managers/iDRAC.Embedded.1.
- WARNING  - Current boot order is set to: director.

I'd like to see the boot order explicitly listed in all cases in order to be sure that it's what I intended. It lists the current boot order in this example:

[bengland@localhost badfish]$ python3 badfish.py -u quads -p 502328 -i config/idrac_interfaces.yml -H mgmt-e26-h05-740xd --check-boot
- INFO     - Systems service: /redfish/v1/Systems/System.Embedded.1.
- INFO     - Managers service: /redfish/v1/Managers/iDRAC.Embedded.1.
- WARNING  - Current boot order does not match any of the given.
- INFO     - Current boot order:
- INFO     - 1: HardDisk.List.1-1
- INFO     - 2: NIC.Integrated.1-1-1
- INFO     - 3: NIC.Integrated.1-2-1
- INFO     - 4: NIC.Integrated.1-4-1
- INFO     - 5: NIC.Slot.7-2-1
- INFO     - 6: NIC.Slot.7-1-1
- INFO     - 7: NIC.Integrated.1-3-1
bengland2 commented 4 years ago

@cenright It works for me, I consider it resolved. Just took some getting used to it and how to make use of it. Might want to touch up documentation. For example, it's not obvious that you don't have to fill in all the interfaces in the boot order list, and it will just take the ones you care about and put them at the front of the list, which is good enough for openshift OCP4 install. Also, it's not obvious but you don't want to change "foreman", you want to leave that one alone so that you can revert the machine to how you found it when the reservation was given to you. And it doesn't let you choose anything besides foreman or director, so I edit "director_740xd_interfaces" entry to support openshift install and then use -t director, since I'm not going to switch reservation from openshift to openstack.

bengland2 commented 4 years ago

clearly the lab team does not like it if you change the boot order but unfortunately for ocp4_upi_baremetal it doesn't know how to operate without changing the boot order on Dells. Maybe there is a way in badfish to do a one shot PXE boot on a particular NIC port to eliminate need for this, will look into this.

grafuls commented 3 years ago

clearly the lab team does not like it if you change the boot order but unfortunately for ocp4_upi_baremetal it doesn't know how to operate without changing the boot order on Dells. Maybe there is a way in badfish to do a one shot PXE boot on a particular NIC port to eliminate need for this, will look into this.

@bengland2 we already have a method for this specific functionality where you con boot to a specific device [1] or to a specific MAC address [2]. Hope this is still useful for you.

[1] https://github.com/redhat-performance/badfish/#forcing-a-one-time-boot-to-a-specific-device [2] https://github.com/redhat-performance/badfish/#forcing-a-one-time-boot-to-a-specific-mac-address

bengland2 commented 3 years ago

I think we're all switching to use JetSki anyway. And I think JetSki auto-sets the boot order using badfish so this should not be an issue going forward.