Open larrymi opened 8 years ago
Looking back in the event log, I don't see any PSU Fault 1 or PSU Fault 2 so exact relevance of these errors is not clear. However, that they occurred between the last good deployment and the first bad one does raise a flag.
Please post /var/log/messages
and /var/log/petitboot/pb-discover.log
- perhaps Petitboot is not seeing the disks.
@sammj please see the attached logs that you requested. logs.tar.gz
From the logs not even the kernel appears to think this machine has disks attached. Do you know exactly what should be appearing, and how it is attached (eg. as part of a RAID array)?
Also please post /sys/firmware/opal/msglog
in case there's a more fundamental problem.
@sammj in the logs that I included, there should be an lshw file. That contains the disks that were previously discovered.
@sammj here's the /sys/firmware/opal/msglog sysfirmwareopalmsglog.log.gz
[7279045336,7] PHB3: Timeout waiting for electrical link
[7279047461,7] PHB3: DLP train control: 0x0fd0001101000000
[7279049941,7] PHB3: Slot freset: Retrying
[7279051718,7] PHB3: Slot freset: Asserting PERST
[7330251688,7] PHB4: Timeout waiting for link up
If your disks are behind one of these PHBs this could be the issue - if possible can you update to a more recent Skiboot version and retest?
I have tried to build the last pflash and I am getting an error when I try to backup my existing skiboot:
To build it, I did the following: "git clone https://github.com/open-power/skiboot.git" On the target ppc64el system, I then ran make: "cd skiboot/external/pflash" "make"
Then this was from trying to backup the existing skiboot: ubuntu@entei:~/skiboot/external/pflash$ ./pflash -r zImage.backup -P BOOTKERNEL Couldn't initialise architecture flash structures
Then, I had planned to run command below to upgrade to v1.12: ./pflash -e -p zImage.epapr -P BOOTKERNEL
With the backup failing, I don't want to risk flashing to the latest until we understand this failure.
@sammj please take a look at the error above. Any idea as to what could be the issue and/or suggestion for how to go about updating?
@larrymi you will need to run pflash as root
Hi, I have started to see the same behavior, without maas deployment. On reboot suddenly it doesn't see any bootable partitions (or partitions in general). I am attaching the logs I collected using pb-sos. diag.zip Would really appreciate any help/suggestion.
Hi Yiannis, Looking at those logs the thing that stands out most is that pb-discover appears to have been run twice:
--- pb-discover ---
lang: en_US.utf8
Detected platform type: powerpc
...
process_read_stdout_once: read failed: Bad file descriptor
event_parse_ad_header: bad header:
--- pb-discover ---
lang: en_US.utf8
Detected platform type: powerpc
There are plenty of errors after that but that will be because you're running into https://github.com/open-power/petitboot/issues/32. This shouldn't happen without manual intervention, did you stop/start the Petitboot service?
That aside, in the first invocation of pb-discover it sees a few disks but either ignores them because they don't have a filesystem or doesn't find anything bootable on them. Which disks are you expecting to see boot options on?
Run nvram --update-config petitboot,debug?=true
and reboot to make Petitboot write some more detailed logs and we'll see if we can track that down.
Hi Samuel,
thanks for the info. I didn't restart petitboot manually, just rebooted the node. I updated the config and these are logs I am getting now.
Looking at the logs and having had a chance to jump on the machine in question, nothing really jumps out as a problem. Of the 15 disks there are four that udev can recognise as having a filesystem: sdc, sdd, sdf, and sdh. sdd and sdf have been mounted as ext4 but aren't boot partitions. sdc and sdh are both LVM members which your version of Petitboot doesn't support (v1.4.4). For everything else Linux is just refusing to mount it since there doesn't appear to be a filesystem. It would definitely help to know how this system was set up so we know what we're expecting to see a filesystem on; otherwise it's very very tempting to say it's not booting because there's nothing to boot. :)
The issue described below has been recreated. Basically, after multiple maas deployments, the boot order is gone.
When it first occurred on 10/18, there was a single entry for the disk but no entries for the network devices. This time there is not a single entry.
It should be noted that for the first occurrence that the issue went away after doing a hard power cycle from remote PDUs on 10/25. In other words, we did the equivalent of unplugging the system and plugging it back in.
The maas deployments are with MAAS 2.1 but it was originally hit while using MAAS 2.0. A curtin preseed workaround has been in place since 4/13 as noted in https://bugs.launchpad.net/maas/+bug/1558747.
System last deployed successfully at ~00:05UTC per maas event log:
The subsequent install fails:
Per the event log extracted from the web console, those events below are the last one shown (extracted at 1130 UTC). Strangely, there are no other events shown after 7634 and 7633 which could be indication of underlying issue that leads to current state of the system. Timestamp do seem to match timestamps on the maas event logs (last successful deployment and first failed deployment).
These are critical errors in the system log that seems to correspond to the last boot (console output below):
From the audit log:
Below is the console output: