hercules-390 / hyperion

Hercules 390
Other
248 stars 67 forks source link

Bug in ECPS:VM altering behavior of SPKA instruction #186

Closed wably closed 7 years ago

wably commented 7 years ago

About 4 years ago, I reported a bug to the Hercules Yahoo discussion group about the ECPS:VM feature when running VM under VM because the SPKA instruction was failing to execute (that is, acting as if it were a no-op).

Several days ago I decided to track down this issue because it keeps me from running ECPS and I really want to do so (with ASSIST ON SVC NOTMR of course). I'll spare you all the gory details of how I tracked it down but the bottom line is I found the problem and I have a solution. I'll try to keep it reasonably brief but some explanatory detail is necessary.

It turns out that when VM is IPLed, there is a LCTL 2,3 instruction very early in the IPL. This code is in DMKCKP just past label CKP002. Just after IPL, the LCTL instruction is located at or near absolute address x'DB8'. CP is trying to initialize the channel masks in CR2 and for some reason also loads FFs into CR3.

The ECPSVM support code in ecpsvm.c in function ecpsvm_dolctl( ) has case statements in order to handle the loading of individual control registers. In particular, this brief snippet of code:

case 3: / DAS Control regs (not used under VM/370) / case 4: case 5: case 7: ocrs[j]=crs[j]; rcrs[j]=crs[j]; break;

Basically, nothing special is being done for CR3-7 and the contents requested by the LCTL instruction are copied into the virtual machine control registers (ultimately to the ECBLOK), and also copied to the real machine's (Hercules') control registers.

The flaw here I believe, is the copying of the new control register values into the real control registers. Remember, we're issuing the LCTL in a virtual machine and ECPS wants to handle it because a supported privileged instruction was issued in real problem state. In no case should a virtual machine be allowed to directly alter the real control registers. If CP were simulating the LCTL execution instead of ECPS, CP loads CR3 into the ECBLOK for the virtual control registers for the virtual machine, but does not alter the real CR3. But the ECPS:VM support is allowing the alteration because of the "rcrs[j]=crs[j];" statement above.

Once CR3 is filled with FFs, then the SPKA instruction begins to fail, and it fails in all virtual machines. This is because there is code in the SPKA instruction handling in control.c that checks CR3 and the PSW key mask bits. Since there is now a non-zero value in CR3, this has meaning. More on this in a bit.

I believe the solution to the problem is to reject the LCTL instruction handling by ECPS when CR3-7 are specified and kick it back to CP to simulate it. At first I thought that just removing the second from last line and therefore not update the real CRs would be sufficient, and that certainly works great in my testing. But then it occurred to me that such a solution might not be a good idea because someone might be running VM/SP. As I recall VM/SP supports DAS. But I am not sure what CP's exact handling is for LCTL in VM/SP with regards to CR3-7, so the safest route would be to kick it back and let whatever CP version it is handle it. All other LCTL instructions issued that do not specify these CRs could benefit from ECPS, and this is the vast majority of LCTLs issued.

Therefore, my proposed solution is: case 3: / DAS Control regs (not used under VM/370) / case 4: case 5: case 7: return 1; / let CP deal with it /

Regarding the SPKA instruction handling in control.c and the check it does against CR3: this is questionable behavior in my mind. I think CR3 should only be checked when the DAS feature is available. Otherwise, it should not be checked because CR3-CR7 are not defined in earlier S/370 machines. However, I am also aware that the Principles of Operation says that undefined bits should be set to zero and in this case that more or less takes care of the issue if that guideline is followed. I mention it here for completeness.

jphartmann commented 7 years ago

Control register 3 contains a bitmask of which keys a Problem state program may set. It is not consulted when a supervisor state program does SPKA.

wably commented 7 years ago

That is a true statement. However, when SPKA is issued in a virtual machine, the real machine is always in problem state. Hence the check of the PSW key mask in control register 3.

jphartmann commented 7 years ago

You are confusing first and second level CR3 here. CP runs in a partition in supervisor state.

wably commented 7 years ago

I respectfully disagree. CP runs in supervisor state it is true. However, when CP dispatches a virtual machine, it is always in problem state. When SPKA is issued by the virtual machine's programming, the real control register 3 will be inspected by Hercules.

This only becomes an issue at all because the ECPS:VM support is modifying real control register 3 with a value supplied by a virtual machine, which should not occur.

jphartmann commented 7 years ago

My apologies. I'm by now so used to thinking SIE that I forget what life was before it. However, SPKA had VMA support on a real 145/8, so it isn't quite the way you describe it either.

On 11/01/17 21:03, wably wrote:

I respectfully disagree. CP runs in supervisor state it is true. However, when CP dispatches a virtual machine, it is always in problem state. When SPKA is issued by the virtual machine's programming, the /real/ control register 3 will be inspected by Hercules.

This only becomes an issue at all because the ECPS:VM support is modifying real control register 3 with a value supplied by a virtual machine, which should not occur.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hercules-390/hyperion/issues/186#issuecomment-271993648, or mute the thread https://github.com/notifications/unsubscribe-auth/ABun04Z2ooI2VY4LUdNVx6QYl-DAhpktks5rRUOzgaJpZM4Lg_nQ.

ivan-w commented 7 years ago

On 1/11/2017 10:39 PM, John P. Hartmann wrote:

My apologies. I'm by now so used to thinking SIE that I forget what life was before it. However, SPKA had VMA support on a real 145/8, so it isn't quite the way you describe it either. I personally think Bob is correct !

In ECPS:VM, when in doubt - leave it to CP (either it be CPA or VMA). The issue here is apparently a wrong handling of VMA LCTL (my bad) - so in doubt : PIC 2 on LCTL (and PIC 1 on when a E4xx instruction can't be handled for CPA)

The performance impact would be minimum.

There is only ONE big issue with CPA/VMA (intrisic issue): there is no way to indicate to CP that the Interval Timer assist is broken/non functional.

(There are other issues with current ECPS:VM especially lack of AP/MP support, missing E4xx instructions... But big performance enhancement occur with the DMKFREEX/DMKFRETX/DMKDSPCH (VMBLOK & CPEXBLOK) assists).

--Ivan

wably commented 7 years ago

Thanks Ivan.

I am aware of the timer issues with the assist support, as you have previously mentioned on the discussion group several times. Thus, I always set ASSIST NOTMR.

I use VM/370 a lot and the ECPS support speeds things up so much that I want to leave it turned on all the time. Eventually I want to work through issues to increase its reliability (except for the timer issues which may be unresolvable) so that the recommendation changes from leaving it turned off to yes please use it.

The only difficulty mostly has been when I think I see something flaky happen in routine use of VM/CMS that might be due to ECPS, I am usually not able to identify exactly what I did that caused the flaky thing and thus I cannot recreate the issue. If I can recreate it, I can fix it.

wably commented 7 years ago

Ivan,

I believe that I have found the cause of the interval timer problems that occur when ECPSVM is YES. I took me about 3 days to track it down but I think I have it. The solution is simple, but I would like to run it by you since you were the author of the ECPS support to see if you agree with my conclusions.

Briefly, a little background. Around three years or so ago, a member of the VM group posted that he could not get STIMER to work correctly in CMS when running ECPS, and of course it worked fine when ECPS was turned off. That issue is easy to recreate and I was able to validate that user's claims. The cause was that with ECPS active, the real machine's interval timer value was being stored in the virtual interval timer for the CMS user. This means that CP's time slice value in the real timer was being imposed upon the virtual machine user, regardless of what value the user wanted to store in his own timer for the purposes of his STIMER application. (In this case, the user wanted to wait for a few seconds). No matter what a CMS application stored in the virtual timer, ECPS overrode it with the real timer's value.

Of course that should not occur. The virtual timer should be decremented at the same rate as the real timer, but the contents of the virtual timer should not be reset to the real value.

The cause appears to be in the Hercules source in clock.c. A small snippet of code is pasted below for reference. The last statement shown is the one in question.

if defined(FEATURE_INTERVAL_TIMER)

static INLINE void ARCH_DEP(_store_int_timer_2) (REGS *regs,int getlock) { S32 itimer; . . itimer=int_timer(regs); STORE_FW(regs->psa->inttimer, itimer);

if defined(FEATURE_ECPSVM)

if(regs->ecps_vtmrpt)
{
    vtimer=ecps_vtimer(regs);
    STORE_FW(regs->ecps_vtmrpt, itimer);

What is happening here is that the current real interval timer value is obtained and stored in PSA, and then if ECPS is active, the current virtual timer value is obtained but the real timer value is stored in the virtual timer location.

The problems are resolved if I change the last statement to STORE_FW(regs->ecps_vtmrpt, vtimer);

It is that simple. All of your timer calculations elsewhere and in ecpsvm.c appear to be done correctly. I can find no errors in processing after this change, and the STIMER example works perfectly and with whatever proper interval I specify. The results are now the same with ECPS on or off.

I'm pretty certain of my facts and findings here, but I always have a bit of doubt because I am not sure of your intentions at the time and there are no comments in the code to help clarify. If you could weigh in and say whether you agree with this change or not it would be very helpful.

Regards, Bob

ivan-w commented 7 years ago

On 1/15/2017 7:26 PM, wably wrote:

Ivan,

I believe that I have found the cause of the interval timer problems that occur when ECPSVM is YES. I took me about 3 days to track it down but I think I have it. The solution is simple, but I would like to run it by you since you were the author of the ECPS support to see if you agree with my conclusions.

The problems are resolved if I change the last statement to

STORE_FW(regs->ecps_vtmrpt, /vtimer/);

It is that simple. All of your timer calculations elsewhere and in ecpsvm.c appear to be done correctly. I can find no errors in processing after this change, and the STIMER example works perfectly and with whatever proper interval I specify. The results are now the same with ECPS on or off.

I'm pretty certain of my facts and findings here, but I always have a bit of doubt because I am not sure of your intentions at the time and there are no comments in the code to help clarify. If you could weigh in and say whether you agree with this change or not it would be very helpful.

Regards, Bob

Bob,

This seems like I had a BIG finger check when I wrote this. Your solution looks to be perfectly sensible to me.

Interval Timer management is a bit complicated in hercules because of the various optimization steps it went through - basically :

Inserting the ECPS:VM logic inside of that was always a bit tricky (hence the convoluted code). Your solution looks sensible and if your tests validate it, I would suggested it be implemented

It's a bit outside the scope of the origin SPKA problem... So maybe open a new issue, and provide the solution you suggest ;)

I'll test it with other VM versions.

--Ivan

wably commented 7 years ago

Ivan, thanks. I reposted my statement of the problem above and your response to issue #187,

wably commented 7 years ago

closing; fixed by commit of 3/4/2017