Closed sharkcz closed 5 years ago
Thanks for the heads-up. Just to untangle a possible confusion, s390utils-2.7.1-4.fc30 is compiled with GCC9 and s390utils-2.7.1-2.fc30 with GCC8 then? Or is the -4 version generally broken? Are there any code changes between -2 and -4?
I assigned our zipl maintainer and if it turns out to be a GCC issue, I'll add one of our tool chain guys to have a look as well.
no code change between -2 and -4 (see https://src.fedoraproject.org/rpms/s390utils/commits/master), -2 is built with gcc8, -4 (and -3 too) with gcc9
you can find the binary packages at https://koji.fedoraproject.org/koji/packageinfo?packageID=255
Hi, have you already got more details about the culprit? Shall I keep looking into it too? Because it makes Fedora-to-be 30 installations unusable.
We're currently quite low on resources. If you've got the time to look into it, that'd be much appreciated! Thanks!
OK, will focus on it. BTW don't you have a tool that would convert the boot records and the bootmap to human readable format?
And for the record - builds with ubsan and asan sanitizers didn't reveal anything wrong.
So the problem is that stage2 is crashing when clearing the BSS section in memory (https://github.com/ibm-s390-tools/s390-tools/blob/master/zipl/boot/libc.c#L359)
disassembly of initialize() from gcc9 compiled zipl
0000000000002a60 <initialize>:
2a60: eb df f0 88 00 24 stmg %r13,%r15,136(%r15)
2a66: c0 d0 00 00 07 61 larl %r13,3928 <__ex_table_stop+0xc>
2a6c: a7 f1 1f 80 tmll %r15,8064
2a70: a7 84 00 01 je 2a72 <initialize+0x12>
2a74: e3 f0 ff e8 ff 71 lay %r15,-24(%r15)
2a7a: c4 18 00 00 08 1b lgrl %r1,3ab0 <conv_vec.2142+0x178>
2a80: e3 10 01 d0 00 24 stg %r1,464
2a86: c0 10 00 00 07 31 larl %r1,38e8 <pgm_check_handler>
2a8c: e3 10 01 d8 00 24 stg %r1,472
2a92: c4 48 00 00 08 07 lgrl %r4,3aa0 <conv_vec.2142+0x168>
2a98: a7 39 00 00 lghi %r3,0
2a9c: e3 40 d0 00 00 09 sg %r4,0(%r13)
2aa2: c4 28 00 00 08 03 lgrl %r2,3aa8 <conv_vec.2142+0x170>
2aa8: c0 e5 ff ff fe a4 brasl %r14,27f0 <memset>
2aae: eb df f0 a0 00 04 lmg %r13,%r15,160(%r15)
2ab4: c0 f4 ff ff fb 96 jg 21e0 <start>
2aba: 07 07 nopr %r7
2abc: 07 07 nopr %r7
2abe: 07 07 nopr %r7
Further tracing shows that __bss_start (in R2) is 0x0000 for gcc9 zipl, while it is the expected 0x5200 for gcc8 zipl. Like the BSS section is not created properly.
And commands to reproduce the crash
dd if=/dev/zero of=/boot/vmlinuz bs=1k count=70 && zipl -t /boot -i /boot/vmlinuz && reboot
You won't get to the Booting default
message that would normally appear.
And the root cause is that gcc9 emits a new section .rodata.cst8 for some literals and it isn't included in --only-section parameters when creating the *.bin images with objdump. PR follows.
Thanks a lot for debugging this. I will have a look at the PR asap and include it.
Fixed via cdb23f8d2251d865f0a8aa46f8ca886441048e13
This is more a heads-up than anything else, because I don't have clear evidence yet. But we see boot failure in Fedora when the boot is prepared with zipl compiled with gcc9, while using one still compiled with gcc8 works. Specifically s390utils-2.7.1-4.fc30 is bad and s390utils-2.7.1-2.fc30 is good.
What is clearly different is the /boot/bootmap file, it differs when created by the 2 different zipl versions that were run on the same /boot content. AFAIK it shouldn't change.
As usual it could be a real gcc9 issue or some problem with zipl code, like relying on undefined behaviour.
The failed boot looks like
which usually means something went wrong very early in the boot, like the "bootmap" content being wrong.