The following bug issue and text is reproduced from Hyperion issue 228 so that this bug may be documented and fixed in Spinhawk.
Several errors have been discovered in the original DISP2 assist code that manifest themselves in strange ways and probably explains why they have not been found until recently.
For starters, this block of code from DISP2:
/* If an Extended VM, Load CRs 3-13 */
/* CR6 Will be overwritten in a second */
if(B_VMESTAT & VMV370R)
{
for(i=4;i<14;i++)
{
regs->CR_L(i)=EVM_L(F_ECBLOK+(3*4)+(i*4));
}
}
There are three things wrong in the code above:
1.The referenced byte B_VMESTAT is incorrect; bit VMV370R is part of byte VMPSTAT.
2.The comment says to load CR 3-13. The code loads CR 4-13. Which is it? The correct answer is CR 4-13, per the corresponding code in DMKDSP.
3.The expression F_ECBLOK+(3 4)+(I 4) is incorrect. The +(3 * 4) term of the expression specifically is the part that is bad. This incorrectly displaces the load point causing the wrong control register contents to be loaded into the corresponding control registers.
The corrected code will be:
/* If an Extended VM, Load CRs 4-13 */
/* CR6 Will be overwritten in a second */
if(B_VMPSTAT & VMV370R)
{
for(i=4;i<14;i++)
{
regs->CR_L(i)=EVM_L(F_ECBLOK+(i*4));
}
}
The next issue refers to the following block of code:
SET_PSW_IA(regs);
/* Dispatch..... */
DEBUG_CPASSISTX(DISP2,logmsg("DISP2 - Next Instruction : %2.2X\n",ARCH_DEP(vfetchb)(regs->psw.IA,USE_PRIMARY_SPACE,regs)));
The reported problem was after turning on ECPSVM debugging messages (via the command 'ecpsvm DEBUG disp2'), then upon IPLing CMS, the following messages were received from CP on the virtual machine console:
OPERAND MISSING OR INVALID
CMS HRC SYSTEM - MAY 2, 2006
DMSITP143T OPERATION EXCEPTION OCCURRED AT 01D280 IN SYSTEM ROUTINE WAITRD, RE-IPL CMS.
CP ENTERED; DISABLED WAIT PSW '00020000 4001B0D6'
It turns out that the DEBUG_CPASSISTX statement in the code block above is the culprit. This statement is only executed when DISP2 debug is active. Within the statement, the ARCH_DEP(vfetchb) is the real source of the problem.
The ARCH_DEP(vfetchb) is attempting to fetch the opcode byte of the next instruction to be executed once the virtual machine is dispatched by DISP2; the DEBUG_CPASSISTX statement simply serves to display the message within and the opcode byte since debugging was active. This becomes an issue when a page fault occurs trying to fetch the opcode byte.
This is difficult to explain, but recall that as this code is executed, the CP assist DISP2 is in progress. The real machine is in supervisor state and CP is in control; from the real machine's point of view all it has done is execute a E611 opcode assist instruction (which happens to invoke DISP2). DISP2 is preparing to dispatch a virtual machine user and just before this code block is executed, the user's register contents have been loaded into the GPRs, and the real PSW has been built and loaded into the PSW (the 'runpsw'). This means that DAT is now on and the machine has been switched to problem state.
Then the flow encounters this DEBUG_CPASSISTX statement containing the ARCH_DEP(vfetchb). If the referenced storage is paged in there is no problem. If it is not, then a page fault occurs. When ARCH_DEP(vfetchb) encounters a page fault, it does not return to the caller (that is, it will not return to the MSGBUF( ) function within DEBUG_CPASSISTX( ). This also means the debug message is not displayed). Instead, ARCH_DEP(vfetchb) presents a page fault to the real machine. CP gets control at this point, at the program check new PSW. CP notes the page fault and believes a user was in control (because the interrupt PSW is in problem state) and dutifully adjusts the instruction address in the virtual PSW based on the ILC and performs the usual page fault handling to bring in the page. Then the dispatcher is called again to dispatch the same user and eventually re-executes the E611 instruction to invoke the DISP2 assist. This time, as the offending DEBUG_CPASSISTX( ) function containing the ARCH_DEP(vfetchb) is encountered, there is no page fault as this has been resolved by CP. The debug message is displayed and DISP2 proceeds to dispatch the virtual machine user.
Except that the virtual PSW had been decremented by the length of the E611 instruction! The actual virtual machine user's opcode was never executed in order to set a proper ILC! The page fault occurred before the actual dispatch completed. The last instruction executed by the real machine was E611 which has an ILC of 6. But because the run PSW had been loaded before the page fault occurred, CP thought it was a valid page fault coming from a virtual machine and CP correctly adjusted the virtual PSW for the ILC. This incorrect backing up of 6 bytes caused the virtual machine to be dispatched with an instruction address pointing in the middle of another instruction. This caused the error messages on the virtual machine console shown above. (Additionally, the DEBUG_CPASSISTX message that was finally displayed instead displays the wrong opcode - six bytes earlier).
There are two possible solutions for this issue. The simplest is to simply remove the offending DEBUG_CPASSISTX( ) statement in its entirety. There is little value in showing the next opcode at dispatch anyway, and we cannot prevent ARCH_DEP(vfetchb) from not returning upon page fault. As long as it might not return, this problem can occur.
The other solution could be to place another statement prior to DEBUG_CPASSISTX( ) to call ARCH_DEP(translate_addr) to determine if the virtual address pointing to the opcode byte is available. This function provides a return code and does not generate a program check. If the return code says translation was not available then simply avoid issuing the DEBUG_CPASSISTX( ). However, given the dubious value of displaying the opcode of the first dispatched instruction as well as the high frequency of calls to DISP2, the added overhead to do the ARCH_DEP(translate_addr) does not seem worth it. Therefore, I will resolve this issue by removing the offending DEBUG_CPASSISTX( ) function.
Many thanks to Peter Coghlan for discovering these issues and making me aware of them. Peter did the vast majority of the legwork to discover what was happening and why these code blocks were causing problems, including the remarkable scenario of the bogus page fault in a CP assist instruction! I was able to recreate the scenarios and verify that the solutions resolved the issues.
The following bug issue and text is reproduced from Hyperion issue 228 so that this bug may be documented and fixed in Spinhawk.
Several errors have been discovered in the original DISP2 assist code that manifest themselves in strange ways and probably explains why they have not been found until recently.
For starters, this block of code from DISP2:
There are three things wrong in the code above: 1.The referenced byte B_VMESTAT is incorrect; bit VMV370R is part of byte VMPSTAT. 2.The comment says to load CR 3-13. The code loads CR 4-13. Which is it? The correct answer is CR 4-13, per the corresponding code in DMKDSP. 3.The expression F_ECBLOK+(3 4)+(I 4) is incorrect. The +(3 * 4) term of the expression specifically is the part that is bad. This incorrectly displaces the load point causing the wrong control register contents to be loaded into the corresponding control registers.
The corrected code will be:
The next issue refers to the following block of code:
The reported problem was after turning on ECPSVM debugging messages (via the command 'ecpsvm DEBUG disp2'), then upon IPLing CMS, the following messages were received from CP on the virtual machine console:
OPERAND MISSING OR INVALID CMS HRC SYSTEM - MAY 2, 2006 DMSITP143T OPERATION EXCEPTION OCCURRED AT 01D280 IN SYSTEM ROUTINE WAITRD, RE-IPL CMS. CP ENTERED; DISABLED WAIT PSW '00020000 4001B0D6'
It turns out that the DEBUG_CPASSISTX statement in the code block above is the culprit. This statement is only executed when DISP2 debug is active. Within the statement, the ARCH_DEP(vfetchb) is the real source of the problem.
The ARCH_DEP(vfetchb) is attempting to fetch the opcode byte of the next instruction to be executed once the virtual machine is dispatched by DISP2; the DEBUG_CPASSISTX statement simply serves to display the message within and the opcode byte since debugging was active. This becomes an issue when a page fault occurs trying to fetch the opcode byte.
This is difficult to explain, but recall that as this code is executed, the CP assist DISP2 is in progress. The real machine is in supervisor state and CP is in control; from the real machine's point of view all it has done is execute a E611 opcode assist instruction (which happens to invoke DISP2). DISP2 is preparing to dispatch a virtual machine user and just before this code block is executed, the user's register contents have been loaded into the GPRs, and the real PSW has been built and loaded into the PSW (the 'runpsw'). This means that DAT is now on and the machine has been switched to problem state.
Then the flow encounters this DEBUG_CPASSISTX statement containing the ARCH_DEP(vfetchb). If the referenced storage is paged in there is no problem. If it is not, then a page fault occurs. When ARCH_DEP(vfetchb) encounters a page fault, it does not return to the caller (that is, it will not return to the MSGBUF( ) function within DEBUG_CPASSISTX( ). This also means the debug message is not displayed). Instead, ARCH_DEP(vfetchb) presents a page fault to the real machine. CP gets control at this point, at the program check new PSW. CP notes the page fault and believes a user was in control (because the interrupt PSW is in problem state) and dutifully adjusts the instruction address in the virtual PSW based on the ILC and performs the usual page fault handling to bring in the page. Then the dispatcher is called again to dispatch the same user and eventually re-executes the E611 instruction to invoke the DISP2 assist. This time, as the offending DEBUG_CPASSISTX( ) function containing the ARCH_DEP(vfetchb) is encountered, there is no page fault as this has been resolved by CP. The debug message is displayed and DISP2 proceeds to dispatch the virtual machine user.
Except that the virtual PSW had been decremented by the length of the E611 instruction! The actual virtual machine user's opcode was never executed in order to set a proper ILC! The page fault occurred before the actual dispatch completed. The last instruction executed by the real machine was E611 which has an ILC of 6. But because the run PSW had been loaded before the page fault occurred, CP thought it was a valid page fault coming from a virtual machine and CP correctly adjusted the virtual PSW for the ILC. This incorrect backing up of 6 bytes caused the virtual machine to be dispatched with an instruction address pointing in the middle of another instruction. This caused the error messages on the virtual machine console shown above. (Additionally, the DEBUG_CPASSISTX message that was finally displayed instead displays the wrong opcode - six bytes earlier).
There are two possible solutions for this issue. The simplest is to simply remove the offending DEBUG_CPASSISTX( ) statement in its entirety. There is little value in showing the next opcode at dispatch anyway, and we cannot prevent ARCH_DEP(vfetchb) from not returning upon page fault. As long as it might not return, this problem can occur.
The other solution could be to place another statement prior to DEBUG_CPASSISTX( ) to call ARCH_DEP(translate_addr) to determine if the virtual address pointing to the opcode byte is available. This function provides a return code and does not generate a program check. If the return code says translation was not available then simply avoid issuing the DEBUG_CPASSISTX( ). However, given the dubious value of displaying the opcode of the first dispatched instruction as well as the high frequency of calls to DISP2, the added overhead to do the ARCH_DEP(translate_addr) does not seem worth it. Therefore, I will resolve this issue by removing the offending DEBUG_CPASSISTX( ) function.
Many thanks to Peter Coghlan for discovering these issues and making me aware of them. Peter did the vast majority of the legwork to discover what was happening and why these code blocks were causing problems, including the remarkable scenario of the bogus page fault in a CP assist instruction! I was able to recreate the scenarios and verify that the solutions resolved the issues.