First level zVM 6.3 install fails - WAIT STATE 9090 on CPU 01

hercules-390 / hyperion

Hercules 390

Other

252 stars 70 forks source link

First level zVM 6.3 install fails - WAIT STATE 9090 on CPU 01 #65

Closed fbi-ranger closed 8 years ago

fbi-ranger commented 9 years ago

Hercules / Hyperion on OPENsuse 13.2 Host OS: NONE / zVM 6.3

Description: Trying to install zVM 6.3 as first level install on Hercules results in a disabled wait 9090 on CPU 1 or other CPUs. In some cases there is a Segmentation fault and Hercules process terminates. In some other cases the installer system survives and is accepting commands. The installer system complains that Processor XX is stopped. (HCPMPG9151E) However after entering the first command 'dvdprime 3390 (dvd' , the system executes this command but remains in running mode. Further commands are not executed any more.

IPL with uniprocessor results in a dispatch of disabled wait state 9090 on CPU 00 and processing is stopped.

Suggestion / Observation: There seems to be a problem in the behaviour of an IPL from DVD when z/VM 6.3 is used. For a z/VM 6.2 the function is performing correctly. It appears that the wait state is dispatched on the wrong CPU. Probably it should be the CPU from which IPL happened. As the installer system continues, it seems that there is maybe a state or interrupt not handled correctly when the IPL from DVD is finished. This wait state should not happen.

According to help of Hercules start command, start without parameter should start any stopped CPU. However Hercules issues HHC00816W Processor CP00: processor not stopped. It seems not possible to start a stopped CPU other than CPU 00. This appears to be a bug.

Enhancement: Command should be enhanced to allow the specification of a discrete CPU number.

Further info: IPL the installer system from DVD image: Command: ipl CPDVD/630vm.ins

Nucleus successfully loaded but during startup Wait State 9090 is dispatched.

IBM z/VM messages:

HCP9090W A hard abend, soft abend, snapdump abend or unknown SVC has occurred and CP has been IPLed from the DVD. Normal system termination and dump generation is not possible in this environment. Explanation:

An abend occurred during product installation and all CP-owned space is on volatile storage (disk in memory). System action:

The system enters a disabled wait state (wait state code 9090). Operator response:

Take a stand-alone dump. Then contact your system support personnel.

zvminstall wait9090 zvminstall herculesconsole2

ghost commented 9 years ago

I am not sure of the cause of your primary Hercules segmentation fault problem (but I can guess that it might have something to do with the compiler you used to build your Hercules with and/or the compiler options that were used (such as -O3 for example; see the recent discussion in the " Oops... Machine check on 3.11" thread on the main Hercules developers list)), but I did want to comment on something else you wrote in your above problem report:

According to help of Hercules start command, start without parameter should start any stopped CPU. However Hercules issues HHC00816W Processor CP00: processor not stopped. It seems not possible to start a stopped CPU other than CPU 00. This appears to be a bug.

This is not true. Each processor can be individually started or stopped as long as you use the proper "cpu context" (or set the proper context beforehand):

HHC01603I help cpu
HHC01603I
HHC01602I Command      Description
HHC01602I -------      -------------------------------------------------------
HHC01602I cpu         *Define target cpu for panel display and commands
HHC01603I
HHC01603I Format: "cpu xx" where 'xx' is the hexadecimal cpu address of the cpu
HHC01603I in your multiprocessor configuration which you wish all panel commands
HHC01603I to apply to. If command text follows the cpu address, the command will
HHC01603I execute on cpu xx and the target cpu will not be permanently changed.
HHC01603I For example, entering 'cpu 1F' followed by "gpr" will change the
HHC01603I target cpu for the panel display and commands and then display the
HHC01603I general purpose registers for cpu 31 of your configuration. Entering
HHC01603I 'cpu 14 gpr' will execute the 'gpr' command on cpu 20, but will not
HHC01603I change the target cpu for subsequent panel displays and commands.
HHC01603I

Enhancement: Command should be enhanced to allow the specification of a discrete CPU number.

This is a valid enhancement request IMO.

I admit it would certainly be more user friendly (and probably less confusing as well) if both the 'start' and 'stop' commands supported a "cpu number" argument identifying which cpu should be started or stopped (e.g. start cpu 1f, stop cpu 14, etc).

I'll see what I can do.

fbi-ranger commented 9 years ago

Hi Fish,

Sorry, my statement should not be understood as criticism. I never used CPU command before. For me, focusing more on the operating systems, I admit that I never played too much with the hardware functions. In day to day operation and I look back to 30 years as sysprog, I never saw a single stopped processor. Normally you have an error during IPL that results in a stop as disabled PSW. So you try to correct a config problem and reIPL. But having one processor stopped while the program continues on the others, I never faced before.

I had the impression that start would loop through the processors and start a stopped one. Maybe a reference to CPU in the help text would be good. E.g. The command starts that processor which was selected by CPU command. The default is processor 00.

Anyway thanks for the advice. I will try tomorrow the CPU command.

Regards, Florian On 16 Sep 2015 21:44, "Fish-Git2" notifications@github.com wrote:

I am not sure of the cause of your primary Hercules segmentation fault problem (but I can guess that it might have something to do with the compiler you used to build your Hercules with and/or the compiler options that were used (such as -O3 for example; see the recent discussion in the " Oops... Machine check on 3.11" thread on the main Hercules developers list)), but I did want to comment on something else you wrote in your above problem report:

According to help of Hercules start command, start without parameter should start any stopped CPU. However Hercules issues HHC00816W Processor CP00: processor not stopped. It seems not possible to start a stopped CPU other than CPU 00. This appears to be a bug.

This is not true. Each processor can be individually started or stopped as long as you use the proper "cpu context" (or set the proper context beforehand):

HHC01603I help cpu HHC01603I HHC01602I Command Description HHC01602I ------- ------------------------------------------------------- HHC01602I cpu *Define target cpu for panel display and commands HHC01603I HHC01603I Format: "cpu xx" where 'xx' is the hexadecimal cpu address of the cpu HHC01603I in your multiprocessor configuration which you wish all panel commands HHC01603I to apply to. If command text follows the cpu address, the command will HHC01603I execute on cpu xx and the target cpu will not be permanently changed. HHC01603I For example, entering 'cpu 1F' followed by "gpr" will change the HHC01603I target cpu for the panel display and commands and then display the HHC01603I general purpose registers for cpu 31 of your configuration. Entering HHC01603I 'cpu 14 gpr' will execute the 'gpr' command on cpu 20, but will not HHC01603I change the target cpu for subsequent panel displays and commands. HHC01603I

Enhancement: Command should be enhanced to allow the specification of a discrete CPU number.

This is a valid enhancement request IMO.

I admit it would certainly be more user friendly (and probably less confusing as well) if both the 'start' and 'stop' commands supported a "cpu number" argument identifying which cpu should be started or stopped (e.g. start cpu 1f, stop cpu 14, etc).

I'll see what I can do.

— Reply to this email directly or view it on GitHub https://github.com/hercules-390/hyperion/issues/65#issuecomment-140866331 .

Fish-Git commented 9 years ago

FYI: there is also the startall and stopall commands too, which will do exactly what one would expect them to do.

Fish-Git commented 9 years ago

I had the impression that start would loop through the processors and start a stopped one. Maybe a reference to CPU in the help text would be good. E.g. The command starts that processor which was selected by CPU command. The default is processor 00.

Sounds reasonable. I'll see what I can do.

Regarding your original problem however, have you tried rebuilding using just -O2? (or using a different compiler?) Does that make the segfault go away?

fbi-ranger commented 9 years ago

Hi Fish,

I did some recompiling. I used only -O2 and did not had the Segmentation fault again.

However when I use the latest snap shot, compile terminates with an error. I sent you already details about that problem.

Currently I have a version that works but the LCS are not working any more. I will investigate that further.

Also the start of the stopped CPU does not help because the installer system has dispatched a disabled PSW. I was definitly on the wrong track with my reflections yesterday. So the processor remains in stopped state and even the other threads of the installer system are continuing, the installation is dead. I had yesterday not too much time to analyse this situation more in detail. Will see what I find over the weekend.

Regards, Florian

On 17 Sep 2015 13:55, "Fish-Git" notifications@github.com wrote:

I had the impression that start would loop through the processors and start a stopped one. Maybe a reference to CPU in the help text would be good. E.g. The command starts that processor which was selected by CPU command. The default is processor 00.

Sounds reasonable. I'll see what I can do.

Regarding your original problem however, have you tried rebuilding using just -O2? (or using a different compiler?) Does that make the segfault go away?

— Reply to this email directly or view it on GitHub https://github.com/hercules-390/hyperion/issues/65#issuecomment-141053804 .

Fish-Git commented 9 years ago

It appears all issues have now been resolved so I am closing this issue.

fbi-ranger commented 9 years ago

Fish,

The wait on one processor was unfortunately only half of the issue. It should not happen at all. Therefore I reopen the issue.

Fish-Git commented 9 years ago

I thought the wait state was caused by the CPU encountering a Machine Check caused by Hercules experiencing a segmentation fault (segfault), which in turn was caused (presumably) by incorrectly optimized code generated by the version of the gcc compiler you were using to build Hercules with (as proven by the fact that the Hercules segfault (and thus, presumably, the disabled wait too) no longer occurred once you built Hercules used different gcc optimization flags?

Is that not the case? Am I misunderstanding something?

Because if it is true that the problem was ultimately being caused by your choice of optimization flags used to build Hercules with, then that is not a Hercules problem. That is a gcc problem.

If however, when you build Hercules using different gcc optimization flags and Hercules does not crash -- but your guest's IPL still fails (disabled wait), THEN something is obviously very wrong with Hercules and yes, we would definitely need to look into that!

So which is it?

Fish-Git commented 9 years ago

Some additional questions:

How much MAINSIZE are you allocating?
Did you specify ARCHLVL Z/ARCH in your configuration file before your ARCHLVL ENABLE ASN_LX_REUSE and ARCHLVL ENABLE BIT44 statements? (as explained in our Release Notes document) May we see your configuration file?
Has this configuration and guest ever successfully IPLed with Hyperion before? Or is this the very first time you have tried it with Hyperion?

I apologize for having mistakenly closed this issue earlier, but I honestly thought your issue was resolved by simply building Hercules with different optimization flags! I apologize for misunderstanding you if that is not the case!

fbi-ranger commented 9 years ago

Ad 1) 8G Ad 2) YES --> https://gist.github.com/fbi-ranger/9d08fec7f6576106d2aa Ad 3) No, however z/VM 6.2 installer IPLed successfully.

No problem.

Kind regards, Florian

ivan-w commented 9 years ago

On 9/28/2015 9:43 AM, Fish-Git wrote:

Some additional questions:

1.
How much |MAINSIZE| are you allocating?
2.
Did you specify |ARCHLVL Z/ARCH| in your configuration file
/before/ your |ARCHLVL ENABLE ASN_LX_REUSE| and |ARCHLVL ENABLE
BIT44| statements? (as explained in our Release Notes
<http://hercules-390.github.io/html/hercrnot.html> document) May
we see your configuration file?
3.
Has this configuration and guest /ever/ successfully IPLed with
Hyperion before? Or is this the very first time you have tried it
with Hyperion?
I apologize for having mistakenly closed this issue earlier, but I honestly thought your issue was resolved by simply building Hercules with different optimization flags! I apologize for misunderstanding you if that is not the case!

— Reply to this email directly or view it on GitHub https://github.com/hercules-390/hyperion/issues/65#issuecomment-143661985.

It's a known issue that GCC 4.9 is having issues with optimizing some of the hercules code. clang/llvm doesn't exhibit the issue (but there are other issues such as passing optimization flags which are unknown to that compiler).

So there are 2 issues there :

Why is hercules encountering issues with gcc 4.9 while there are tens of thousands of other projects that work perfectly with it ? (yet another aliasing issue ?)
Why isn't configure.ac testing for flag availability (notably in the decimal code directory)

--Ivan

jphartmann commented 9 years ago

Why is Hercules getting into trouble with -O3 and gcc 4.9 when others don't. I expect that (1) few use -O3, which is advertised as dangerous, and (2) they don't have the code bloat of all the zillions of macros, for example each instruction that accesses does the address calculation over again, leading to an overly complex program which might trip the optimiser. Most other projects try not to duplicate code.

fbi-ranger commented 9 years ago

The compiler options a remains from a time where my workstation was quite poor. So every option brought a little performance boost. Up to now I also didn't had problems with -O3. Here are the options I used:

--enable-optimization="-fomit-frame-pointer -O3 -fno-strict-aliasing -ggdb3 -D_FORTIFY_SOURCE=0 -march=native -mfpmath=sse -msse4.1 -fexpensive-optimizations"

Some of the options such as -ggdb3 I gained from the discussion list. -march=native optimises for the processor on which I run Hercules. I don't know if SSE is beneficial. At least it does no harm.

Florian

ivan-w commented 9 years ago

On 9/28/2015 12:55 PM, John P. Hartmann wrote:

Why is Hercules getting into trouble with -O3 and gcc 4.9 when others don't. I expect that (1) few use -O3, which is advertised as dangerous,

Reference needed !

and (2) they don't have the code bloat of all the zillions of macros, for example each instruction that accesses does the address calculation over again,

That's what is mandated by the Principle of Operations (the load and store on the xIY operations was a glitch - it's now fixed) (access calculation IS complicated, when you take into account Real, Absolute, translated (DAT, DAS, Access register mode, XC mode, TLB acceleration, Intruction Fetch and Data fetch, Etc..)

leading to an overly complex program which might trip the optimiser. Most other projects try not to duplicate code.

If the code is correct C code it shouldn't trip the optimizer. If it isn't, it needs to be fixed. Period. And WHAT code is being duplicated ? (most use of macros is simply to inline TLB access - in a form that is compatible with the architecture for which it is compiled for).

— Reply to this email directly or view it on GitHub https://github.com/hercules-390/hyperion/issues/65#issuecomment-143710824.

Most of the issues are related to erroneous aliasing (C standard mandate that aliasing a storage location is equivalent to restricted pointers - we should use "union"s instead - to indicate that modifying a pointed storage location by a type may/will also modify the same storage location pointed to by another type).

for example :

void foo(int _a,float b) { a=1; *b=_b+1; }

MAY (depending on optimization) lead to incorrect results if "a" and "b" point to the same location since the C compiler optimizer WILL assume (due to to the C standard strict aliasing rule) that a & b point to different locations.

The correct thing to do would be :

typedef bar union { int a; float b; }; void foo(bar *x) { x->a=1; x->b=x->b+1; }

--Ivan

Fish-Git commented 8 years ago

@fbi-ranger (Florian):

Upon review of opened issues I happened to notice the following APAR:

VM65186: HCP955W INSUFFICIENT STORAGE WAIT STATE AT IPL

which seems to describe exactly the problem you are experiencing.

For what it's worth, I have also experienced unusual z/VM behavior myself with z/VM 5.3 (crashes, progam checks, disabled waits, etc) when a Hercules MAINSIZE value greater than 2GB is used, just as the APAR describes:

****************************************************************
* USERS AFFECTED: All users of z/VM with more than 2GB of      *
*                 central storage configured.                  *
****************************************************************

(Note: I also noticed same/similar issues when z/VM guests are defined with more than 2GB of storage too)

I am unfamiliar with how to download and install an APAR so I cannot try this for for myself (and besides I only have z/VM 5.3), but maybe you can do that on your system to see if it helps any?

I am going to close this issue but PLEASE RE-OPEN IT AGAIN if the above mentioned APAR does not fix your problem.

Thanks!