hercules-390 / hyperion

Hercules 390
Other
248 stars 67 forks source link

MVS wait code 00090064 during IPL under VM with ECPS:VM active #193

Closed wably closed 7 years ago

wably commented 7 years ago

When attempting to IPL MVS 3.8J under VM/370 with ECPS:VM active, a disabled wait code PSW of 00020000 00090064 is sometimes issued by MVS. When it occurs, the wait code appears after responding (just pressing ENTER) to the MVS message IEA101A SPECIFY SYSTEM PARAMETERS.

According to OS/VS2 System Codes, wait 064 is issued because of a program check during nucleus initialization (the x'09' in the code specifically means program check), and that the program old PSW points to the instruction that failed. The problem is, the program old PSW contains a wait PSW: 070E0000 00000004. You cannot have a program check while in a wait state, so something is amiss here.

Skipping over the details of hours of research, debugging, single stepping and so forth I tracked the issue to the DISP2 assist of ECPS:VM. It turns out that DISP2 is dispatching the run user (MVS) even though the user's virtual PSW is in a wait. DISP2 dutifully builds the dispatch PSW by merging in the virtual instruction address with a standard CP dispatch PSW. Since the virtual PSW instruction address is 0, the resulting dispatch PSW is 070D0000 00000000. Then DISP2 then exits so that control can be given to the run user. MVS immediately program checks because the instruction address is 0. The value 070E0000 ends up in MVS's program old PSW because that's what the virtual PSW was.

The bottom line is that DISP2 should not be dispatching a user that is in virtual PSW wait. Moreover, there are dispatchability flags in the VMBLOK that indicates that a user should not be dispatched for a number of reasons, and one of them is VMPSWAIT (in byte VMRSTAT) which means the user is in virtual PSW wait. The assist code in DISP2 is not checking this flag.

But even if DISP2 did check this flag, it would not resolve the problem. It turns out that the flag is not set anyway. I have been unable to resolve how the user can have a virtual PSW with the wait bit set and not have the VMPSWAIT bit set.

Regardless, adding a check in the DISP2 code to see if the wait bit is set in the virtual PSW is the likely solution. This will cause the user to be skipped and another runnable user to be selected or the machine idled. The solution is this code snippet:

    if(EVM_LH(vmb+VMPSW) & 0x0002)
    {
            DEBUG_CPASSISTX(DISP2,MSGBUF(buf, "DISP2 : VMB @ %6.6X Not eligible : User in virtual PSW wait",vmb));
        DEBUG_CPASSISTX(DISP2,WRMSG(HHC90000, "D", buf));
        continue;
    }

This new code should be located immediately after this line in ecpsvm_do_disp2( ):

          for(vmb=EVM_L(FW1);vmb!=FW1;vmb=EVM_L(vmb))
          {

My justification for this solution is based on these points: • There is no case where a user in a virtual wait state should be dispatched. • While I cannot explain the reason for the discrepancy between the VMPSWAIT dispatchability flag and the wait bit in the virtual PSW, the rules throughout the ECPS code logic say: when in doubt about something, let CP handle it. The new code does exactly that. This is a dispatch case that cannot be reconciled, so let CP deal with it. • I do think there is a problem somewhere that allows this discrepancy to occur but I have been unable to find it. Nevertheless, the solution code does resolve the issue. Thus, the safest course when something isn't right is to turn it over to CP.

The problem of the MVS wait 064 is resolved after implementing the solution above.

PeterCoghlan commented 7 years ago

I appear to have come across this one too. While trying to start an RSCS link driver with TRACE PROG active:

*** 000002 PROG 0001 ==> 0104D8 D 28.8 000028 FF060001 40000002

suggesting that RSCS took a PROG 1 interrupt while in a wait state. This one was quite elusive. Even more elusive was the single one I got at IPL time:

IPL 191 RDR 001 DETACHED RDR 001 DEFINED *** 000002 PROG 0001 ==> 000007 (I seem to have mislaid the contents of the old PSW in this case unfortunately.)

It is very hard to be completely sure but so far, there is a very high degree of correlation between disabling DISP2 and the problem not occurring. Re-enabling DISP2 does result in it occurring again.

I am currently testing on V3.12 so I applied a tweaked version of the above change to that version rather than trying and risk failing to get the bug to appear using Hyperion.

I haven't seen the problem since but it is very hard to be certain that it is gone as the tearing down and setting up again of the environment required to apply the fix makes it hard to know if I have successfully recreated the conditions under which it used to occur. It looks good so far though.

wably commented 7 years ago

closing; fixed by commit of 3/4/2017