hercules-390 / hyperion

Hercules 390
Other
253 stars 70 forks source link

Ubuntu 17.10.1 installer won't boot under hercules #259

Closed ivan-w closed 6 years ago

ivan-w commented 6 years ago

Attempting to boot an Ubuntu linux installation image (either as a list directed IPL or OMA tape) leads to the following symptom :

Uncompressing Linux...

XZ-compressed data is corrupt

-- System halted HHC00809I Processor CP00: disabled wait state 0002000080000000 000000005EADBEEF

Booting the very same kernel works under qemu-system-s390x (tcg)

Looking at the executed code and instructions used, it seems the kernel (or more specifically in this case the pre-boot decompressor) is using modern instructions (September 2012 version of the PoP).

When comparing the execution of hercules and qemu tcg, the execution path starts to diverge after ~19K+ instructions or so where a compare fails - which succeeds under qemu - but the error could have been caused much earlier !

Extracting the compressed kernel and "xz" uncompressing doesn't succeed either (it fails under different conditions).

I suspect at this time that one or some of the new instructions is not behaving as expected.

Looking into it.

ivan-w commented 6 years ago

Apparently there is a problem with the R[IXON]SBG[N] series of instructions... For example the when using the following code :

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
int main(int ac,char **av)
{
        uint32_t        x;
        x=atol(av[1]);
        printf("%u\n",x/4);
        return 0;
}

This yield the following object code when compiled with gcc 7.3 and flags -O3 -march=arch10 : (comments are mine)

 680:   eb ef f0 70 00 24       stmg    %r14,%r15,112(%r15)     Save regs
 686:   a7 49 00 0a             lghi    %r4,10                  Base 10 for strtol
 68a:   e3 f0 ff 60 ff 71       lay     %r15,-160(%r15)         Update stack pointer
 690:   e3 20 30 08 00 04       lg      %r2,8(%r3)              get argv[1]
 696:   a7 39 00 00             lghi    %r3,0                   set NULL ptr for strtol return
 69a:   c0 e5 ff ff ff e3       brasl   %r14,660 <strtol@plt>   call strtol (atol)
 6a0:   ec 32 22 bf 3e 59       risbgn  %r3,%r2,34,191,62       Divide results by 4 <<<< That's where we have a problem !
 6a6:   c0 20 00 00 00 e9       larl    %r2,878 <_IO_stdin_used+0x4>    Get printf format string
 6ac:   c0 e5 ff ff ff ba       brasl   %r14,620 <printf@plt>   Call printf
 6b2:   a7 29 00 00             lghi    %r2,0                   set return code to 0
 6b6:   e3 40 f1 10 00 04       lg      %r4,272(%r15)           Get return address
 6bc:   eb ef f1 10 00 04       lmg     %r14,%r15,272(%r15)     Restore stack pointer
 6c2:   07 f4                   br      %r4                     return to caller

When executing the program under hercules (running debian sid) I get the following result :

ivan@deb001:~/t1$ ./t1 1000
1000
ivan@deb001:~/t1$

(which is incorrect) The same thing under qemu (tcg) - same OS level :

ivan@deb002:~/t1$ ./t1 1000
250
ivan@deb002:~/t1$

(which is correct)

ivan-w commented 6 years ago

Found the bug... Correction and tests in progress (If I'm right, it's a 1 line fix)

ivan-w commented 6 years ago

Fixed by 96752854594b593749df7955416b98407e322bf8

Peter-J-Jansen commented 6 years ago

Super Ivan ! The bug is most certainly also in Spinhawk, because that's where I backported the code from.

This bug might have been responsible for some other anomalies I encountered. I will test these and let you know.

Cheers,

Peter J.

jphartmann commented 6 years ago

Ivan, can I suggest you also add a default: and reflect a prg001 to avoid a quiet no-op next time an instruction is forwarded to DEF_INST(rotate_then_xxx_selected_bits_long_reg)?

ivan-w commented 6 years ago

Good idea ! But I think this really should trigger a machine check (abort() ?), since it's a machine malfunction (microcode error), rather than a guest programming error... Ideally, it should be a Machine Check / Instruction Processing Damage (can't remember if a CPU abort will generate a System Damage or Instruction Processing Damage in hercules). (edit)... A CPU abort should generate a Instruction Processing Damage alright ;)

jphartmann commented 6 years ago

You have a point there. If one half of a feature is implemented and the facility bit is set, using an unimplemented instruction could validly be categorized as a machine failure. In this case, it depends on whether CLT and CLTG are implemented and the facility bit is off; if not, the facility is simply not installed and PRG001 is correct.

Another matter is that the generated code you are showing for the division by four is incorrect; it drops the two leftmost bits of the unsigned int in addition to the two rightmost. SRLK 3,2,2 is the correct code to emit for this particular case.

ivan-w commented 6 years ago

First, in that case, the instruction WAS implemented (it was just wrong in the implementation ! It wasn't actually a no-op since the bit selection was still applied and the Zero out of the non selected bits was still done on the target reg). Basically invoking rotate_then_xxx_selected_bits_long_reg by an instruction implementation not taken into account is a programming error in the implementation.

SRLK 3,2,2 and RISBGN 3,2,34,128+63,62 will yield the same results (Bits 32-33 set to 0 in the target, target register bits 34 to 63 contain bits 32 to 61 of the source, and source bits 62 & 63 are thrown away), CC unchanged - except RISBGN will also clear bit 0-31 of the 64 bit register while SRLK won't.

jphartmann commented 6 years ago

Right you are on both counts, Ivan.

I missed the fact that the entire second operand is rotated before masking. Sorry.

ivan-w commented 6 years ago

Great ! I'm going to test the effect of an abort() in the function to see if it yields the appropriate results.

(However, you remind me I made a slight booboo in my previous patch - (not really related)... I have set bit 49 in the STFL/STFLE list, but we're still missing the PPA (Perform Processor Assist) instruction (but right now it's going to be a NO-OP, since we don't yet have the Transaction Facility feature implemented - but strangely enough, PPA is not dependent on the Transaction facility).

I'll open a new issue (for historical tracking purpose).

I'm also reopening this issue and close it when the default:/abort() is validated and implemented.

jphartmann commented 6 years ago

Right, that bit 49 sure covers a multitude of sins.

ivan-w commented 6 years ago

Closed again by be52cf87c43ef51b6694d973ddee31faf59e578d

Peter-J-Jansen commented 6 years ago

Ivan, the abort() statement without a semicolon causes an compilation error message under Windows. For the time being in my local copy, I've commented it out to continue testing. Cheers, Peter

ivan-w commented 6 years ago

Thanks for catching this ! I'll fix this right away !

ivan-w commented 6 years ago

Typo fixed by a07606c0ad1dbb42dcb88bd7c1d278db2257b229