hercules-390 / hyperion

Hercules 390
Other
246 stars 69 forks source link

Incorrect calls to getrusage() #217

Open PeterCoghlan opened 7 years ago

PeterCoghlan commented 7 years ago

I had a problem with the CPU timer (SPT and STPT instructions). I found the error which was one I made myself and I fixed it. In the course of investigating the problem, I came across code like this in two locations in clock.c:

    rc = getrusage((int)sysblk.cputid[regs->cpuad], &rusage);
    if (unlikely(rc == -1))
        result = host_tod();
    else
        result  = timeval2etod(&rusage.ru_utime);

The first argument appears to be the thread-id of a CPU thread, however, all the documentation I can find for getrusage() suggests that the first argument is supposed to be a small integer constant and that a thread-id is not valid in this position.

I inserted some debugging code to output the value of rc and errno and found that (under Linux on a Raspberry Pi at least), getrusage() fails every time it is called with rc = -1 and errno = EINVAL. This is in accordance expected behaviour documented in 'man getrusage' on the same platform. If nothing else, this means the use of unlikely() is inappropriate here.

It seems that there is no getrusage() in Windows and a getrusage() replacement is provided in w32util.c. However, this variant appears to have been coded to cope with a thread-id in the first argument position. (I don't have anything running Windows so I can't test whether this theory is correct.) My feeling is that this is not a valid implementation of getrusage() and if it is to be retained the way it is, it should be renamed to something else to avoid confusion.

There is a similar call to getrusage() in hsccmd.c for the qproc command:

            if (getrusage((int)sysblk.cputid[i], &rusage) == 0)
            {
             ...
             }

As no fallback method is provided in this case, it appears to me that under Linux anyway, getrusage() will always fail here and the block of code will never get executed.

There are two more calls to getrusage() in diagmssf.c like this:

            if (IS_CPU_ONLINE(i))
            {
                /* Get CPU times in microseconds */
                getrusage((int)sysblk.cputid[i], &usage);
                uCPU[i] = timeval2us(&usage.ru_utime);
                tCPU[i] = uCPU[i] + timeval2us(&usage.ru_stime);
            }

As there is no check to see if getrusage() failed here and it looks like it always will, on Linux anyway so results are likely to be incorrect.

I think it may be possible to rework the code in clock.c so that getrusage() is called correctly but I don't see how to do this in the other cases.

My suggestion is to remove the calls to getrusage() in clock.c, hsccmd.c and diagmssf.c and replace them with calls to a new function implemented elsewhere which takes a thread-id as an argument and calls platform specific code depending on a macro set in hostopts.h.

jphartmann commented 7 years ago

Originally, getrusage() got the CPU usage of the process or the process and its children.

As of the linux 2.6.26 kernel, you can also obtain thread usage by:

getrusage(RUSAGE_THREAD_SELF, ...

The getrusage you quote above is not likely to give a meaningful answer on Linux. How about checking the return code? EINVAL means that you are asking for something you cannot get.

Thread usage wast ported to/from FreeBSD 8.1. Other UNIXen must be investigated; it is not clear to me that they would return EINVAL, but it is likely that RUSAGE_THREAD_SELF is undefined on such a beast.

ivan-w commented 7 years ago

I read RUSAGE_THREAD (not RUSAGE_THREAD_SELF) in my documentation for getrusage() - but it's a detail.

I'm interested in seeing this through because I'm getting some issues running z/VM under z/VM which MAY be related to a CPU Timer issue (but I haven't finished my analysis yet).

dasdman commented 7 years ago

Peter:

Be careful on this path (I was the one who originally put getrusage into Hercules for the CPU Timer usage; the architecturally incorrect call to host_tod was the call that was in use at the time the changes were made to use getrusage).

Historically, getrusage accepts[ed] the thread ID (but one had to read past the first level nix documentation). As far as the "int" for /who/ goes, that means 16-bits in a 32-bit compilation, and 32-bits in a 64-bit compilation -- thus actually matching t_id within the system. The reason getrusage is supposed* to still handle this correctly is that one, by original design, is to be able to get the CPU usage of an individual child thread (or any other thread) within the same process. This does not say, however, that the "gods of Linux" haven't forgotten this and in the "name of security" made changes that should have never been made, or something was munged in the Raspberry Pi porting of Linux.

Now the reason why this works[ed] is that thread_ids are positive, while the getrusage /who/ macro definitions use(d) are usually -2, -1, 0, and 1, all of which are special cases that don't occur for user thread id's (http://lxr.free-electrons.com/source/include/uapi/linux/resource.h#L18). It is also possible on the Raspberry Pi that there are other definitions in play that are affecting getrusage.

Sidebar: In a quick review of some of the Linux kernel code, it appears that some of the modules have now incorrectly hardcoded some of the thread ids specifically as u32. If this is what is going on with the Raspberry Pi issue you are having, it is specifically an error in Linux, be it either with the getrusage definition or with the kernel. But this also explains why there is other software that fails specifically on the Raspberry Pi. Somebody forgot about certain definitions between the user calls and the kernel calls.

In addition, Hercules is Linux based by design, and Windows support is provided as a "layer" to provide equivalent Linux service calls (should this be called "Line" instead of "Wine" ("WINdows Emulator)?). We ended up with the creation of the layer for Windows emulating the Linux calls due to various issues with Cygwin and other utilities under Windows (and, yes, we tracked down many of the issues to fundamental design and implementation flaws in Cygwin, et al.). Placing the Linux specific calls into a separate layer as was done for Windows at this particular point in time is not recommended as there are more productive areas for improvement and advancement of Hercules.

dasdman commented 7 years ago

Peter and Ivan:

Additional note. If getrusage fails, doublecheck errno for the reason. Also, the definition of /who/ does not preclude thread_id; RUSAGE_THREAD, et al., are actually shorthand indicators such that one does not need the thread_id to obtain the proper results for the indicated thread(s). However, in Hercules, we need to individually get the updates for threads which are not necessarily the running thread.

jphartmann commented 7 years ago

Neither the FreeBSD manpage nor the Linux one mention specifying a thread ID. If it "works", that is by the UNIX weenie's definition of "works". Stay away from that.

dasdman commented 7 years ago

Just verified both the BSD and Linux kernel source code. BOTH have been modified in the existing call sequence to define our usage as EINVAL. This was not the case at the original point in time, which is why we did not "stay away". Since then there has been chatter about adding a "proper" call on the Linux lists, but I have not found the actual implementation of it in my brief search so far. Grrrr.

ivan-w commented 7 years ago

I think there are multiple situations :

If there is no facility to retrieve thread run time information - there is not much that can be done - but only to fall back to process times or wall times. For systems that only provide thread time info to self, for a CPU thread, it's probably possible to use some exchange mechanism to request the information from the thread (it's already done for some other purpose such as CPU synchronization, Broadcast PTLB, etc..).

I'm personally more concerned about the 1st case (and its sub-cases)... The rest is usually only needed for information purpose, not for anything nanosecond accurate !

dasdman commented 7 years ago

Following some very early morning research, for someone to work on, try, and adapt.

Sample replacement code:

TOD
thread_cputime(const REGS *regs)
{
     TOD             result;
     clockid_t       cid;
||struct timespec cputime;
     int             rc;

     rc = pthread_getcpuclockid(sysblk.cputid[regs->cpuad], &cid);
     if (likely(rc == 0))
     {
         rc = clock_gettime(cid, &cputime);
}
     result = (likely(rc == 0) ? timeval2etod(&cputime) : host_tod();

     return (result);
}

This code should readily work; Windows equivalent will also need to be created. On another note, the use of the register specification for variables, unfortunately, has been deprecated. With -O0 and -O1, this is a significant loss; with -O2 and -O3, not as much, but I've noticed more than a few instances where GCC no longer optimizes when it should.

jphartmann commented 7 years ago

You'll find the standard here:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/getrusage.html

Note that it does not mention RUSAGE_THREAD.

Also note that the contents of the returned structure is not specified.

ivan-w commented 7 years ago

John,

Of course it doesn't. RUSAGE_THREAD is a linuxism (and possibly a GNUism). But I cannot find in posix (overall) or posix threads a capability to retrieve thread specific run time information.

The definition of the layout of "rusage" structure isn't defined per se in getrusage(), but the fields the structure should include are defined in : http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_resource.h.html

jphartmann commented 7 years ago

getrusage() has been there as long as I remember. It is both in SVr4 and 4.3BSD and thus AIX, but not the thread support, of course.

PeterCoghlan commented 7 years ago

I have one more suggestion.

It appears to me that at the moment the fallback method in the event that thread specific CPU time information is not available is to use wall clock time for the CPU timer. My suggestion is that the fallback method should be to use wall clock time less whatever time the particular Hercules CPU in question has spent in a wait state. This ought to be a much better approximation than assuming the CPU has been running all the time.

ivan-w commented 7 years ago

Peter, That would be odd since the CPU timer is supposed to keep running when the CPU is in a wait state.

PeterCoghlan commented 7 years ago

Well, now I'm confused!

Do the times returned by getrusage() include the time that a CPU thread spends waiting I wonder? I assumed that they don't and that the reason for having the calls to getrusage() was because the CPU timer only ran when the CPU was executing instructions.

If the CPU timer runs all the time then perhaps using wall clock time is the right thing to do and there is no reason to be calling getrusage() at all?

ivan-w commented 7 years ago

Peter,

No.. Because the wall time may not reflect the time the CPU thread is actually executing instructions. The thread may be preempted by the OS to execute another thread, some paging may occur, the CPU instruction execution code may make system calls which may suspend the thread, or the system running hercules may itself be virtualized.

Using the 'process' getrusage would on the other hand mean you are counting for 1 CPU thread the amount of time running ALL threads in hercules between 2 points in time. The only solution for an accurate CPU Timer maintenance is to be able to get the actual usage time for the thread while it is executing instructions.

PeterCoghlan commented 7 years ago

Ok. It sounds like the time spent executing instructions by the CPU thread would need to have the time that spent by the same CPU in a wait state added to it then to come up with an accurate figure.

ivan-w commented 7 years ago

Peter..

Yeah... We know that ! The issue (as reported) is exactly that : Getting the time spent by the CPU thread executing instructions. The rest is housekeeping.

ivan-w commented 7 years ago

For info... I tried that : (just to see) :


#if defined(RUSAGE_THREAD)
    if(pthread_equal(sysblk.cputid[regs->cpuad],pthread_self()))
    {
        rc = getrusage(RUSAGE_THREAD, &rusage);
    }
    else
#endif
    {
        rc = getrusage((int)sysblk.cputid[regs->cpuad], &rusage);
    }
    if (unlikely(rc == -1))
        result = host_tod();
    else
        result  = timeval2etod(&rusage.ru_utime);

    return (result);

It doesn't work.. AT ALL ! (the CPU Timer gets really crazy values).

(Example : a program running in a loop, disabled...)

(Wait ~16 seconds) 17:45:04 HHC02324I CP00: PSW=0008000080042010 INST=B208C42A SPT 1066(12) set_cpu_timer 17:45:04 HHC02326I CP00: R:00042430:K:06=00000010 00000000 00000000 00000000 ................ ... (1st STPT) 17:45:04 HHC02324I CP00: PSW=0008100080042418 INST=B209C432 STPT 1074(12) store_cpu_timer 17:45:04 HHC02326I CP00: R:00042438:K:06=00000000 00000000 00000010 00000001 ................ (0 ?????) .... (2nd STPT) 17:45:04 HHC02324I CP00: PSW=0008100080042418 INST=B209C432 STPT 1074(12) store_cpu_timer 17:45:04 HHC02326I CP00: R:00042438:K:06=2D9E184C 009EB600 00000010 00000001 ...<............ (What the ????) .... (3rd STPT) 17:45:04 HHC02324I CP00: PSW=0008100080042418 INST=B209C432 STPT 1074(12) store_cpu_timer 17:45:04 HHC02326I CP00: R:00042438:K:06=2D9E184B FB0B2700 00000010 00000001 ................ (We're decrementing alright...)

--Ivan

ivan-w commented 7 years ago

Side comment... Apparently Posix Threads have a per thread clocking mechanism

pthread_getcpuclockid() used in conjunction with clock_gettime()... Could you use this ?

--Ivan

Fish-Git commented 7 years ago

(sigh!) Sorry about that. Here are the corrected (non-framed) links:

dasdman commented 7 years ago

From the comments, it appears that folks are repeating the research already done to even post what I already posted (2017-04-12 11:32:40 UTC) -- well BEFORE the current barrage of responses! Would it be possible to actually move the comments and research ahead?

ivan-w commented 7 years ago

True ! My bad - missed your previous comment about pthread_getcpuclockid() et al.

So.. Question 1 : Do we need to invoke the pthread_getcpuclockid() function every time or just at thread creation (and store it in an array - one for each CPU - like we do for the thread id) ? Question 2 (for Fish) : What would it take to implement a fthread__getcpuclockid() and clock_gettime() ?

--Ivan

jphartmann commented 7 years ago

Q1: Yes.

ivan-w commented 7 years ago

John,

Why ? Are there any conditions that would cause the clock id returned by the pthread_getcpuclockid() to change within the lifespan of the thread ?

jphartmann commented 7 years ago

Sorry. Perhaps you could ask one question only?

Q1: No, Yes.

Fish-Git commented 7 years ago

Question 2 (for Fish) : What would it take to implement a fthread__getcpuclockid() and clock_gettime()?

Coding identical functionality in Windows should be trivial I should think. We're basically almost already doing the same thing with our current getrusage() implementation.

Peter-J-Jansen commented 6 years ago

I think to have fixed this problem.

The issue was first described by Peter Coghlan (as Issue #217 in the official Hyperion GitHub), and has now been solved along the lines as initially described by Mark "dasdman" L. Gaubatz, and commented on by Ivan, John, and Fish (all of which can be read up on GitHub).

The solution as described by Mark needed some amendments, and I would like to receive feedback on the choices I made in the process. The extra problems to solve were these :

(1) Mark's original getrusage() produced two statistics : thread user time, and thread system time.  Fish's Windows equivalent implementation for getrusage() supplies the Windows thread user and kernel times for these.  Only when running under Windows does the Hercules qproc command correctly show these CPU thread user and kernel times.  This qproc behavior has been retained as is.

(2) The clock_gettime() replacement for getrusage() has been mentioned on the 'net to produce the sum of both user and system time.  I have assumed that this is correct, even though I'd like further confirmation this to be correct.  The Windows clock_gettime() equivalent for thread CPU time therefore returns the sum of the Windows user and kernel times. 

(3) The original "diagmssf.c" code for DIAG 204 delivered the VM virtual CPU time as the getrusage() user time (uCPU); the VM total CPU time (tCPU) was computed as the sum of the getrusage() user+system time; the VM wait time (wCPU) was computed as the difference tCPU-uCPU.  This would imply that the VM wait time equals the getrusage() system time.  I don't think that is correct.  In order to correct this I have added and computed regs->waittime_accumulated in "timer.c" to set wCPU. (This requires OPTION_MIPS_COUNTING to be #defined, but after inspection, I think that is already implicitly required anyway in other places.)  With uCPU set to clock_gettime() thread user+system time, tCPU is now computed as uCPU+wCPU. I believe this delivers the more accurate DIAG 204 data.

(4) The clock_gettime() for thread user+system CPU time needs the thread clockid.  Following the thinking as voiced by Ivan and John, I've added sysblk.cpuclockid[cpu] so that only one pthread_getcpuclockid() call per CPU thread is needed.  

(5) The Windows equivalent for thread clockid simulates this as the negative of the thread_id, and the clock_gettime() equivalent then uses this negative to call the existing Windows getrusage() to obtain and sum the Windows thread user and kernel times.  Fish may wish to improve or beautify this simple but straightforward concept.   

I'd welcome feedback on the above.

Cheers,

Peter J.

superdave commented 6 years ago

There seem to be a few Unix implementations (including OS X 10.11, which I'm running, but also at least Solaris 11) that don't have one or more of pthread_getcpuclockid() or clock_gettime().

superdave commented 6 years ago

On earlier OS X, it looks like Mach's thread_info() is the way to go (yuck).

https://github.com/st3fan/osx-10.9/blob/master/xnu-2422.1.72/osfmk/mach/thread_info.h

On Solaris, I'm not quite sure what the answer is; it's surmised that getrusage() with RUSAGE_LWP is probably correct, since a thread is an LWP in Solaris.

I don't think there's a consistent solution for every Unix OS for this since threads arrived well after the various paths of Unix diverged and GNU/Linux does its own thing anyway.

PeterCoghlan commented 6 years ago

Thanks to everyone for their contributions to this issue.

I originally reported this issue only because I found that the calls to getrusage() were failing all the time resulting in inefficiency due to falling back to an alternative method of calculating the CPU timer value.

Having read the many responses and studied the issue some more, I have become convinced that trying to use thread specific CPU usage time to calculate the value of the Hercules CPU timer is not the right way to go and will never yield the correct results.

Given that the CPU timer is to run both when the CPU in question is running and when the CPU in question is waiting, I believe that the correct solution is to use a CPU specific offset from wall clock time to calculate the current value of the CPU timer except when the CPU in question is stopped, in which case, it's CPU timer is also stopped.

That methods of calculating thread specific CPU usage are variably implemented and system specific is all the more reason not to use them, on top of their not being suitable for the task at hand.

dasdman commented 6 years ago

First off, no implementation of the CPU timer will be "perfect" under emulation. Everything is best effort, and varies according to OS and chipset. The most accurate is provided by most operating systems -- the thread time, but that, as you noticed is not always possible. Thus we must delve into secondary methods when and only when that facility is not available, and that is where your suggestion does indeed have merit. But just maintaining a simple offset value is insufficient. A combination of methods is actually needed.

Failover is occurring as the getrusage form used was correct WHEN it was introduced, but was using "loopholes" in the specification based on reading the underlying code implementing getrusage. This worked well for many years. The form used by failover is the ORIGINAL Hercules implementation for the CPU Timer.

The CPU Timer thread needs a mechanism to establish the amount of time the CPU is in a runnable or wait state, vs. not dispatched/running or stopped. The proposed mechanism deals only with stopped.

The real reason for using getrusage under Linux and systems supporting the facility is that it provides access to a built in timing facility that does not run when a given (CPU) thread is not running. That is key for the CPU Timer, as the CPU Timer does not run when an LPAR or CPU is not dispatched/running on the real hardware, just as a userid's CPU on VM (take a peek at VM's CP source code some time in regards to CPU Timer handling). So under an alternative, what method are you suggesting to deal with this point to achieve somewhat accurate times?

On 02/06/2018 02:37 AM, PeterCoghlan wrote:

Thanks to everyone for their contributions to this issue.

I originally reported this issue only because I found that the calls to getrusage() were failing all the time resulting in inefficiency due to falling back to an alternative method of calculating the CPU timer value.

Having read the many responses and studied the issue some more, I have become convinced that trying to use thread specific CPU usage time to calculate the value of the Hercules CPU timer is not the right way to go and will never yield the correct results.

Given that the CPU timer is to run both when the CPU in question is running and when the CPU in question is waiting, I believe that the correct solution is to use a CPU specific offset from wall clock time to calculate the current value of the CPU timer except when the CPU in question is stopped, in which case, it's CPU timer is also stopped.

That methods of calculating thread specific CPU usage are variably implemented and system specific is all the more reason not to use them, on top of their not being suitable for the task at hand.

PeterCoghlan commented 6 years ago

The understanding I have come to is that "CPU" in the title "CPU timer" is there purely to indicate that each CPU has one. It is not there to suggest that this timer has anything to do with measuring anything associated with the CPU. It is just a timer. It's purpose is to measure time, like the clock on the wall. It should keep going at the same rate whether the CPU is running or waiting and it only stops when the CPU is manually stopped.

It could be argued that it is a more realistic emulation of hardware by Hercules to slow a particular CPU timer when the associated Hercules CPU emulation thread is not getting much execution time but I disagree. A real hardware CPU timer is not going to diverge from running at wall clock speed as long as the associated CPU is not manually stopped. Attempting to measure CPU usage of CPU threads may lead to more realistic emulation in one respect but it is being traded for a loss of accuracy elsewhere. I feel the smaller gain in accuracy does not justify the larger loss of accuracy.

Regarding CP emulation of CPU timers, the CPU that CP is emulating has the same characteristics, architecture and instruction set as the real CPU running the emulation so perhaps it may be feasable for CP to adjust it's emulated CPU timers based on what it's emulated CPUs are doing. However, in Hercules, the CPU(s) running the emulation is/are complete unknown quantities with characteristics, architectures and instruction sets which are almost always completely different to the CPUs being emulated. I feel it is unrealistic and prone to error to equate usage of CPU time by an emulation thread in the Hercules environment with emulated CPU time made available by that thread.

jphartmann commented 6 years ago

In an overcommitted LPAR environment, the CPU timer will not follow the wall clock.

Incidentally, there is no longer a CPU timer register; the value is computed by millicode on demand.

PeterCoghlan commented 6 years ago

Can I put it another way and say that when we are not emulating an LPAR environment, the CPU timer will follow the wall clock?

If it is desirable to to be able to very accurately emulate an overcommitted LPAR environment, I think we need a way to measure emulated CPU time made available by Hercules CPU emulation threads. I don't think measuring the CPU usage time of the unknown quantity emulating CPU or CPUs is the way to do this. Perhaps it would be possible to assign a theoretical time taken by each instruction emulated by Hercules, individually sum the times required for all the instructions emulated by each particular CPU thread, add time spent waiting by each emulated CPU (should that be wall clock time waiting or some other definintion of time waiting based on what?) and use that figure to drive the CPU timer for each Hercules emulated CPU? Will this produce a more accurate result in the case of an overcommitted LPAR environment? Perhaps, depending on the accuracy of the instruction timings that were assigned but noting that in real hardware, differing operand data will likely result in differing timings and we have not accounted for that. Will it produce a more accurate result in a non-LPAR environment? If it is more accurate than using the wall clock, I'll eat my hat.

ivan-w commented 6 years ago

In an overcommitted LPAR environment, the CPU timer will not follow the wall clock.

That's not what the POO says...

When the TOD-clock-steering facility is not installed,
and both the CPU timer and the TOD clock are running,
the stepping rates are synchronized such that
both are stepped at the same rate.

So unless someone issued a PTFF instruction (to account for a leap second or a NTP update..)

And no matter what...

How can the CPU timer run SLOWER when executing instructions rather than being in a wait state ?

--Ivan

jphartmann commented 6 years ago

In an LPAR environment, a partition can be suspended; in that state its CPU timer is not decremented.

As there no longer are CPU timer registers in the hardware, it is futile to argue what something that isn't there does or does not.

PeterCoghlan commented 6 years ago

There is not necessarily a conflict there. Perhaps the CPU timers (all of them) and TOD clock will all run equally slow in an overcommitted LPAR environment when compared to the wall clock?

Thinking some more, if all the CPU timers in a particular environment - LPAR, Hercules instance or whatever don't run at the same rate all the time (excluding the stopped CPU case), timing similar events on different (emulated) CPUs using the CPU timer will yield different results. This is unlikely to be the desired effect.

Here's another case to chew on. Consider Hercules emulating two CPUs while running single CPU hardware. Using the thread CPU usage model, each emulated CPU timer could end up running at less than half wall clock speed. Exactly the same setup running on a four CPU host would likely have it's emulated CPU timers running much faster.

ivan-w commented 6 years ago

In an LPAR environment, a partition can be suspended; in that state its CPU timer is not decremented.

That's the equivalent of a manual stopped state... Nothing to do with a CPU overcommit in a PR/SM or z/VM hypervisor...

As there no longer are CPU timer registers in the hardware, it is futile to argue what something that isn't there does or does not.

That's an implementation detail of some three letter vendor - nothing to do with what this implementation does !

Fish-Git commented 6 years ago

For those of us with little or no experience with LPAR hardware, what is "an overcommitted LPAR environment"?

ivan-w commented 6 years ago

For those of us with little or no experience with LPAR hardware, what is "an overcommitted LPAR environment"?

That's a good question. I understand memory overcommit (where more memory is allocated than what is available) - which may entail paging/swapping if the hypervisor allows it - and permanent storage overcommit in the case of thin provisionned storage resources - but CPU ?

I guess it would mean the sum of LPAR absolute share exceed 100% of the available resources in the assigned resource pool... But that's dubious !

Peter-J-Jansen commented 6 years ago

Am I correct in assuming that when an LPAR partition is in the suspended state, all it's CPU's are in the stopped state ? If not, in what state would all it's CPU's be otherwise ?

The PoP specifies that a CPU timer is only not being decremented when it's CPU is in the stopped or check-stop states, and I believe that this is being catered for in the Hercules implementation.