time: time.Sleep and time.NewTimer: Fix duration parameter error

johnrs commented 7 years ago

What version of Go are you using (`go version`)?

go version go1.7.3 windows/amd64

What operating system and processor architecture are you using (`go env`)?

set GOARCH=amd64 set GOHOSTARCH=amd64 set GOHOSTOS=windows set GOOS=windows

What did you do?

I wrote a test which is basically a loop consisting of two things: Some busy work for about 500us and a call to time.Sleep(1 * time.Millisecond).

Here's a link to the test program: https://play.golang.org/p/VYGvIbTPo3

It can't run on the playground, so you will have to grab it and run it yourself.

What did you expect to see?

  work(~500us) + time.Sleep(1ms): 1.500000ms    ... or slightly more
  work(~500us) + time.After(1ms): 1.500000ms    ... or slightly more

What did you see instead?

  work(~500us) + time.Sleep(1ms): 1.030103ms
  work(~500us) + time.After(1ms): 1.110111ms

Discussion

I believe that there are errors in the doc's for time.Sleep and time.NewTimer. They both make similar statements regarding the duration.

Sleep pauses the current goroutine for at least the duration d.

NewTimer creates a new Timer that will send the current time on its channel after at least duration d.

If these are correct, then the test loops should take ~1.5ms or longer per pass. But they don't.

I believe that at least the duration d was an attempt to convey that overhead might result in a longer duration than requested. But there are two other factors which might result in a shortened duration, sometime MUCH shorter. These factors are the resolution and phase of the time source used for these functions.

In Windows, the resolution is normally 1ms for Go programs. I'll refer to this value as r, short for resolution. I'm not sure what it is on other platforms. If you run the test program it will measure and display it. Please let me know what you find.

I believe that in general the duration can be as short as d-r (rather than d). How short simply depends on what the phase of the time source when you use it. Thus if you ask for a 100ms delay you will end up with something between 99 and 100 (plus overhead). Thus as you use smaller values of d (approaching r) the problem gets worse. By itself, this isn't too bad. But now the other shoe drops.

In the specific case that you ask for a delay of r (or anything between 1 and r), then the delay can be anything between 0 and r. Actually, for a reason I don't quite understand, the lowest delay I've actually seen is about 50us, not 0. Even so, that is 20 times less than the 1ms requested. This was drastic enough to break some of my code. I changed d from 1ms to 2ms. The resulting delay is between 1 and 2ms (plus overhead).

Can this cause a problem? Generally, in a loop only the first pass is affected, because the loop gets in phase with the timing source. So this might or might not be bad for you. If it is, put a sleep 1ms call just before the loop.

But imagine that you are using time.After to set up a timeout, but it occurs too quickly. Perhaps 20 times too quickly. This can cause intermittent, non-deterministic, bogus timeouts, even on a single-threaded program. Ouch! This is what bit me.

Another gotcha: Using d of 10ms now, but Go Windows switches r from 1ms to 15.6ms. Instead of a delay of 9 - 10ms you now get 0 - 15.6ms.

Possible Solutions

The really bad case when d is between 1 and r could be detected causing d to be changed to r+1. Tacky, in my opinion. Alternatively, vet could look for this case.

Most programming languages I've worked in allow this to happen, but they don't document it incorrectly. This seems like the best solution to me. Simply fix the doc's to reflect reality.

Perhaps a paragraph at the top describing what the "resolution of the time source" means, what it is for some typical platforms, and an example of a simple routine to determine what yours is. Maybe even a system call to retrieve it.

Then, change the current time.Sleep description from

Sleep pauses the current goroutine for at least the duration d. A negative or zero duration causes Sleep to return immediately.

to something like

Sleep pauses the current goroutine for at least the duration d - 1. A duration less than or equal to the time source resolution can cause Sleep to return quickly. A negative or zero duration causes Sleep to return immediately.

And similar for time.NewTimer's description.

rakyll commented 7 years ago

Not sure this is a documentation issue. On darwin and linux, we sleep more than duration (at least, i cannot reproduce the other case). The Windows behavior looks broken to me.

johnrs commented 7 years ago

Interesting. Were you able to run the test program? I'd be curious to see your results.

If it is just a WIndows problem then perhaps adding 1 to whatever d is, as long as it isn't 0, might be all it takes to solve it.

rakyll commented 7 years ago

What I see on darwin:

Verifications

  time.Now resolution: 48ns
  time.Sleep(1ms):     1.266082ms
  time.After(1ms):     1.243748ms
  doWork(~500us):      501.122µs    (Adjust as needed)

Tests

  work(~500us) + time.Sleep(1ms): 1.904777ms
  work(~500us) + time.After(1ms): 1.881365ms

rakyll commented 7 years ago

/cc @alexbrainman

johnrs commented 7 years ago

Wow! I was thinking that perhaps the Linux time source would be good to 1us. 48ns!

This is what I see for the verifications on my Windows machine.

  time.Now resolution: 1.0001ms
  time.Sleep(1ms):     1.0001ms
  time.After(1ms):     1.0001ms
  doWork(~500us):      480.048µs    (Adjust as needed)

I see what you mean about the resulting durations being a bit (about 25%) too long. That makes no sense at all to me. It makes me wonder if there is a math error somewhere (using 4 instead of 5, or vice versa). Perhaps it is worth creating an issue for this. 25% is too much to ignore.

rsc commented 7 years ago

@johnrs, what output do you get from your test program on Windows? And which version of Windows are you running?

I tried it on our Windows testing systems and I get:


Verifications

  time.Now resolution: 976.5µs
  time.Sleep(1ms):     1.953ms
  time.After(1ms):     1.953ms
  doWork(~500us):      332.01µs    (Adjust as needed)

Tests

  work(~500us) + time.Sleep(1ms): 1.97253ms
  work(~500us) + time.After(1ms): 1.97253ms

A 976.5µs interval is 1024 Hz. In order to sleep 1ms, we wait for two ticks to elapse (one is not enough), giving 1.953ms.

I'm curious what resolution you are seeing.

The documentation is a correct statement of the semantics we intend to guarantee.

The implementation is that on entry to time.Sleep, the runtime calls runtime.nanotime, adds the duration, records that sum, and puts the goroutine into a list of pending timers. At some point in the future, the timer-handling goroutine calls runtime.nanotime, sees that the current time is now at or beyond the time at which the original goroutine should be woken up, and wakes it up.

There are always going to be rounding errors possible depending on the actual precision of the underlying system clock. I'm not sure what there is to do about that. In general we probably don't want to make all sleeps take longer than they should.

FWIW, on Windows runtime.nanotime is using "Interrupt Time", which is at least recorded in 100ns units. It may only be updated at 1024 Hz. Similarly, time.Now uses "System Time", which again is 100ns precision but maybe only 1024 Hz accuracy.

johnrs commented 7 years ago

When I run the test on my system this is what I get:

Verifications

  time.Now resolution: 1.0001ms
  time.Sleep(1ms):     1.0001ms
  time.After(1ms):     1.0001ms
  doWork(~500us):      480.048µs    (Adjust as needed)

Tests

  work(~500us) + time.Sleep(1ms): 1.030103ms
  work(~500us) + time.After(1ms): 1.110111ms

I'm running Windows 7.

I believe that most Windows systems are 1.001ms (minimum) and 15.600ms (default). This is what I've seen on various versions and systems over the years. It's also what I see most people refer to when discussing it.

The ClockRes program, part of the SysInternals suite (www.sysinternals.com), shows this "system clock resolution" when I'm running a Go program:

Current timer interval: 1.000 ms

So it seems that our systems differ in 2 ways: 976.5us vs 1.0001ms (no big deal), and two ticks vs 1 tick for the Sleep and After timing. Strange!

Using 1 extra clock tick, as you describe, sounds like a good idea to me. At first I thought this might be a tacky solution, but upon further consideration I like it. It makes is "safe" (no 20 times too short intervals) and that's the most important thing. But that's not what seems to be taking place on my system.

btw - If I run the test under Go 1.6.3 I see the same results, but based on 15.6ms instead of 1ms. I first noticed this problem under Go 1.5.2 and posted a note about it in go-nuts on 2015/12/24, so it's nothing new. I just forgot to follow up on it then.

alexbrainman commented 7 years ago

I have nothing to add to @rsc explanation. My Windows 7 outputs:

Verifications

  time.Now resolution: 976.5µs
  time.Sleep(1ms):     2.060415ms
  time.After(1ms):     1.953ms
  doWork(~500us):      361.305µs    (Adjust as needed)

Tests

  work(~500us) + time.Sleep(1ms): 2.021355ms
  work(~500us) + time.After(1ms): 1.97253ms

You are correct that Windows 7 have 2 timer resolutions: 1ms or 16ms. When system starts, it defaults to 16ms. But once any program calls timeBeginPeriod(1), the wide system timer resolution goes to 1ms, and stays with 1ms until the program exits.

Initially every Go program called timeBeginPeriod(1) at the start. But then ab4c9298b8185a056ff1152f2c7bd9b38d3d06f3 stopped that. And then we discovered some problems with our change, and 0435e88a119fd057aa7209591ba3dff122c9f24c started calling timeBeginPeriod(1) again.

Alex

alexbrainman commented 7 years ago

@johnrs you could also debug what is happening on your computer. As far as I remember most code will be in runtime\time.go.

Alex

johnrs commented 7 years ago

I did a safe-boot, just to see if any software on my system might be affecting it. I got the same result. I also tried a 32-bit version. Still the same. I sent a copy of the exe's to a few friends locally and asked them to try it. I'll let you know what they find.

What type of debugging did you have in mind?

johnrs commented 7 years ago

@rsc & @alexbrainman - I have found something interesting. The problem seems to be related somehow to the Windows system's minimum timer interval (which is 0.5ms on newer systems but 1ms on older systems, from what I read). Simple way to tell. Grab this program and run it. Here's what I see.

ClockRes v2.0 - View the system clock resolution
Copyright (C) 2009 Mark Russinovich
SysInternals - www.sysinternals.com

Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 15.600 ms

Go switches you to 1ms. The timeBeginPeriod call is hard coded to use only 1ms. But if you run something else which switches to 0.5ms (using ntdll.dll - NtSetTimerResolution) at the same time then the nature of the problem I'm seeing changes. It goes away!

If anyone wants to duplicate this, here's a handy tool to make the change and to show what the system is current running.

Windows System Timer Tool
TimerToolV3.zip - 24.08.15 [16:46 UTC] by tonfilm
https://vvvv.org/contribution/windows-system-timer-tool
https://github.com/tebjan/TimerTool

I've upgraded my test progam by adding more test periods, and mde them longer for better precision. Here it is: https://play.golang.org/p/7AwpUwZSDa

I'll follow up with the exact info on what I saw today shortly.

johnrs commented 7 years ago

Here are my results using the new version of the test program. The first is the default situation, letting Go switch the system timer to 1ms. The second is with another program running in the background which has set the system to 0.5ms. In the second case the problem goes away!

=====  #1 - System timer at default (set to 1ms by Go)

Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 1.000 ms

--- Test program results

Checks

  time.Now resolution:  1.000ms
  Work(~250us):         246us

Tests
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.000ms    1.000ms    1.004ms    2.000ms    2.000ms
  ... + work:  0.360ms    1.008ms    1.002ms    1.008ms    2.008ms    2.014ms
  time.After:  0.002ms    1.000ms    1.000ms    1.000ms    2.000ms    2.002ms
  ... + work:  0.296ms    1.004ms    1.006ms    1.004ms    2.004ms    2.010ms

=====  #2 - System timer interval set to 0.5ms (which supersedes Go's 1ms request)

Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 0.500 ms

--- Test program results

Checks

  time.Now resolution:  0.500ms
  Work(~250us):         235us

Tests
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.009ms    1.000ms    1.000ms    2.004ms    2.000ms
  ... + work:  0.279ms    1.214ms    1.329ms    1.476ms    2.425ms    2.457ms
  time.After:  0.004ms    1.001ms    1.000ms    1.000ms    2.001ms    2.002ms
  ... + work:  0.310ms    1.321ms    1.007ms    1.472ms    2.383ms    2.453ms

=====

Also, I have some feedback from two friends. One does not see the problem. The other does - and his system timer is running at 0.5ms (which is what seems to solve the problem for me). I've sent them the new version of the test program to try.

Update: 2nd friend just sent me his results from the revised test. His system timer is indeed running at 0.5ms, but he sees the problem regardless - except for the 1.5ms and 2ms tests which do show the extra time due to the work. Strange!

geraldstanje commented 7 years ago

it seems that in windows 10 the following command (windows.h) sets the max timer interval to 1ms: timeBeginPeriod(1) therefore winmm.lib needs to be linked. is that right?

johnrs commented 7 years ago

I believe that Go does link to it.

alexbrainman commented 7 years ago

What type of debugging did you have in mind?

I do not have any particular idea. But when things look odd to me (just like something bothering you here), I try to add fmt.Printf statements around code in questions (you would have to use println, because you will be debugging runtime package), and see what it outputs. Sorry I did not spend enough time looking at this issue, so I cannot give anything particular to investigate.

I believe that Go does link to it.

Here https://github.com/golang/go/blob/master/src/runtime/os_windows.go#L282 runtime calls timeBeginPeriod(1) during startup.

Alex

johnrs commented 7 years ago

Here are the results of some runtime testing I did today.

Note: I edited/compressed the results listings to save a little verticle space. In doing so, I left out the time.After() data since it mirrored the time.Sleep data.

Note: You might ask why I included the Windows 0.5ms tests. Go only sets it to 1ms. Because the fastest current program running wins, so if another program is using 0.5ms then that is what all programs will see. Also, it not having one problem, but having another instead is probably important.

=====  Runtime (go1.7.3 windows/amd64) tests using my machine

I tested with Windows timer running at 1ms and 0.5ms.

nanotime() looked good.  It returned the time in 1ms / 0.5ms steps.

cputicks() looked good.  Two adjacent calls were seperated by a count of about 70, 
with a resolution of 2.

tickspersecond() looked good.  It was ~1.855 * 1e9 (which matches my CPU clock).

usleep() seems to have a problem.  When Windows is set to 1ms there is no evidence 
of the work being done, but at 0.5ms there is (except for the 0.5ms test).  I'm not clear 
on exactly how usleep() is supposed to work ("at least" guarantee and n+1 ticks ???), 
but the results shouldn't be inconsistent.

---  time.Now resolution:  1.000ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  usleep:      0.002ms    0.000ms    1.000ms    1.006ms    2.000ms    2.000ms
  ... + work:  0.332ms    0.308ms    1.000ms    1.000ms    2.004ms    2.012ms

---  time.Now resolution:  0.500ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  usleep:      0.000ms    0.001ms    0.500ms    1.006ms    1.500ms    2.010ms
  ... + work:  0.339ms    0.296ms    0.515ms    1.357ms    1.951ms    2.475ms

=====

For reference, using the regular test, here is what I see.

=====  My machine - Windows 7, 64-bit

---  time.Now resolution:  1.000ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.008ms    1.000ms    1.002ms    2.012ms    2.000ms
  ... + work:  0.278ms    1.000ms    1.006ms    1.002ms    2.040ms    2.046ms

---  time.Now resolution:  0.500ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.007ms    1.000ms    1.003ms    2.000ms    2.003ms
  ... + work:  0.365ms    1.482ms    1.327ms    1.318ms    2.455ms    2.520ms

---  Problems

1) No evidence of n+1 ticks.
2) When Windows is 1ms there is no evidence of the work being done.
3) When Windows is 0.5ms the steps are still 1ms.

=====

Here are some results from a few friends' machines.

=====  Brian's home machine - Windows 10, 64-bit

---  time.Now resolution:  0.500ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.144ms    1.155ms    1.183ms    2.045ms    2.133ms
  ... + work:  0.261ms    1.159ms    1.155ms    1.138ms    2.214ms    2.262ms

1) No evidence of n+1 ticks.
2) When Windows is 0.5ms there is no evidence of the work being done.
3) When Windows is 0.5ms the steps are still 1ms.

=====  Don's home machine - Windows 7, 32-bit

---  time.Now resolution:  1.000ms
                  0          1        0.5ms       1ms       1.5ms       2ms
               -------    -------    -------    -------    -------    -------
  time.Sleep:  0.000ms    1.038ms    1.004ms    1.046ms    2.050ms    2.010ms
  ... + work:  0.246ms    1.198ms    1.020ms    1.024ms    2.072ms    2.024ms

---  Problems

1) No evidence of n+1 ticks.
2) When Windows is 1ms there is no evidence of the work being done.

=====  Marc's work machine #1 - Windows XP, 32-bit

---  time.Now resolution:  15.625ms
                 0          1        0.5ms       1ms       1.5ms       2ms
              -------    -------    -------    -------    -------    -------
 time.Sleep:  0.000ms    1.938ms    1.969ms    1.938ms    1.969ms    2.938ms
 ... + work:  0.250ms    1.938ms    1.969ms    1.938ms    1.969ms    2.906ms

---  Problems

None.  This matches @rsc and @alexbrainman's good results, including n+1 ticks.

=====  Marc's work machine #2 - Windows 10, 64-bit

--- time.Now resolution:  0.500ms
                 0          1        0.5ms       1ms       1.5ms       2ms
              -------    -------    -------    -------    -------    -------
 time.Sleep:  0.000ms    1.397ms    1.332ms    1.386ms    2.200ms    2.297ms
 ... + work:  0.275ms    1.316ms    1.373ms    1.314ms    2.205ms    2.379ms

---  Problems

1) No evidence of n+1 ticks.
2) When Windows is 0.5ms the steps are still 1ms.

=====  Marc's work machine #3 - Windows 10, 64-bit

---  time.Now resolution:  0.999ms
                 0          1        0.5ms       1ms       1.5ms       2ms
              -------    -------    -------    -------    -------    -------
 time.Sleep:  0.000ms    1.447ms    1.399ms    1.390ms    2.249ms    2.431ms
 ... + work:  0.219ms    1.502ms    1.522ms    1.480ms    2.271ms    2.495ms

---  Problems

1) No evidence of n+1 ticks.

=====

Summary: I don't see a pattern. Unfortunately I wasn't able to get both 1ms and 0.5ms results for all machines, but even so many of the results contradict each other. Of the 3 identified problems, various machines (depending on the Windows timer setting) have 0, 1, 2, or all 3 problems.

Anyone have any ideas of what to do next?

Thanks,

John

alexbrainman commented 7 years ago

When I run the test on my system this is what I get:

Verifications

time.Now resolution: 1.0001ms time.Sleep(1ms): 1.0001ms time.After(1ms): 1.0001ms doWork(~500us): 480.048µs (Adjust as needed)

Tests

work(~500us) + time.Sleep(1ms): 1.030103ms work(~500us) + time.After(1ms): 1.110111ms

@johnrs I do not know what your problem is. But, I suspect, you expect work(~500us) + time.Sleep(1ms) to be sum of time.Sleep(1ms) and doWork(~500us) (so 1.0001ms + 480.048µs = 1.480148ms, instead of 1.030103ms). But that is not how things work here, I think. Imagine you are waiting for a train at a train station. And the train arrives every 1000µs. If you decide to skip 9 trains and catch the 10th train, it will take you about 10000µs. Imaging you decide to change this exercise to have a smoke after each train arrives (let's say smoking will take you 500µs), and then (after smoking) wait for the next train. After the adjustment, it will still take 10000µs before your 10th train arrives.

Once you start playing with intervals close to your clock resolution, unusual things will happen.

Alex

johnrs commented 7 years ago

@alexbrainman

work(~500us) + time.Sleep(1ms): 1.030103ms

So your system has the problem, too.

I started off thinking that this was just a documentation error. But now we know that some machines yield the correct result (something "at least" 1ms + work of ~500us --- which turns out to be 2ms because Go adds a tick to guarantee the "at least").

Even a simple loop just doing a 1ms wait (when the system timer is also 1ms) should take 2ms per pass - like Russ and others see on their machines.

Please read above where Russ states that the doc's are correct. Read where he says: "In order to sleep 1ms, we wait for two ticks to elapse (one is not enough)".

Your example of waiting for a train is flawed. The interval you have to wait for the first train to arrive can be anything from 0 to 1ms. Only after you are in sync with the timer will the additional trains require waiting 1ms each. But as Russ confirmed, to deliver "at least the duration" Go waits for an additional tick.

Using your train example again, this means that you would ignore the next train - regardless of how long (0 - 1ms) it takes to arrive, then you would count the train after it as the "first" train (1 - 2ms).

I see nothing "unusual" about this. It's a bug. Many programming languages don't bother to ensure the minimum wait time, thus exposing you to a potentially serious problem. After all, if you ask for 1ms and end up with something much, much less this can cause faults. Unfortunately, the "extra tick" (which prevents this problem at the cost of a slightly longer wait than requested) doesn't work on some (yours and mine!) Windows systems.

The usleep function in runtime seems to exhibit the same problem. A quick check shows 7 places where usleep is called in runtime expecting between 1 and 1000us to pass. Any of these can fault to an almost immediate return.

This is not a boundary bug. The error is 1 tick, hence is more apparent at or below the boundary. But that doesn't make it unimportant. The 7 cases in runtime prove this. If you had run the revised test I posted you could also see the results at 2ms.

alexbrainman commented 7 years ago

So your system has the problem, too.

If you are referring to the work(~500us) + time.Sleep(1ms): 1.030103ms, that is not from my "system". I was quoting you from https://github.com/golang/go/issues/17696#issuecomment-258053493

Alex

johnrs commented 7 years ago

@rsc I've been trying to figure out where in the code the extra tick period is added. It seems to be in runtime: time.go, timerproc(), where t.period is tested. The code adds a tick (unless delta is already greater than a tick). But I'm confused about a few things.

Who determines what the period is?
Who puts the period into the timer?
What prevents timerproc from adding a period again on the next loop?

I looked inside a running timer. The period seemed to be 0 all the time (perhaps causing the problem I am seeing). But even if I set it (prior to the "when" time) it didn't seem to change anything. So I must be misunderstanding something.

Side question: I'm curious why wait to add the extra tick at the end of the duration instead of doing it at the beginning? It seems to me that doing it at the beginning is extremely easy and would save a considerable amount of processing. Perhaps I'm missing something?

usleep: From what I gather usleep uses a totally different mechanism. I see 7 places in runtime where it is called for waits of 1ms or less. Is it a problem if the wait is actually much less than requested?

ianlancetaylor commented 7 years ago

@johnrs My apologies if I misunderstand. As far as I know, the Go runtime does not add any extra tick period. Why would it? When t.period > 0, it sets the time that the timer should fire next to t.period adjusted by the time since the timer fired last.

t.period is set by runtime.NewTicker, which builds a *timer and passes it to the runtime package.

johnrs commented 7 years ago

@ianlancetaylor The doc's imply that it does. Russ said that it did. Some systems perform as if it does. I see some code which seems to implement it partially. Why should it? If you do then worst case is your interval is 1 tick too long. If not then worst case is your interval is near 0.

Thanks for the explanation of t.period. That explains why it would continue to keep adding time. I was totally focused on just the +1 tick and I was thinking this was it, but not so. :)

Regardless of whether there should be +1 tick or not, I'm seeing too many inconsistent results. At a bare minimum I think that it's fair to say that the doc's don't agree with about half of what I'm seeing. I can't help but think that this is a problem.

ianlancetaylor commented 7 years ago

Russ didn't say that the time package adds an extra tick. He said that the package lets a timer sleep until the time is past the point where it should wake up, and that the tick granularity of the system may cause that to require an additional tick. The effect is that a timer always sleep too long and never sleeps for too short a time. Nothing explicitly adds a tick. Perhaps that is what you mean.

I agree that the results you are seeing on WIndows look wrong. It seems that nobody else is seeing incorrect results. I don't know why.

johnrs commented 7 years ago

@alexbrainman

If you are referring to the work(~500us) + time.Sleep(1ms): 1.030103ms, that is not from my "system". I was quoting you from #17696 (comment)

Oops! I stand corrected.

johnrs commented 7 years ago

@ianlancetaylor I would disagree. What Russ said was: "In order to sleep 1ms, we wait for two ticks to elapse (one is not enough), ". There's no "may" require an extra tick. How else can you achieve the desired result (without spinning)? All I can think of is by extending "when" by a tick or having a pre/post que which results in a 1 tick delay. The former is much easier than the latter.

I am not the only one seeing incorrect results. 5 out of the 8 Windows machines reported here so far are failing.

ianlancetaylor commented 7 years ago

I don't understand what you are disagreeing with.

I'll repeat: the package lets a timer sleep until the time is past the point where it should wake up. After the first tick, we check the time. We see that 1ms has not yet elapsed. We go back to sleep. After the second tick, we check the time. We see that 1ms has elapsed. We send a value on the channel or call the function. There is no extending, there is no pre/post queue.

It sounds like there is a bug when running on Windows, but I don't know what it is. I would look at the systime function in runtime/os_windows.go. If that is somehow inconsistent then the algorithm will fail. I note in particular that your program is printing time.Now, which is based on systime(_SYSTEM_TIME), but the timer algorithm described above is based on systime(_INTERRUPT_TIME). Any discrepancy between those, such as one somehow advancing out of sync with the other, could produce the results you are seeing.

johnrs commented 7 years ago

I started to reply but in doing so I reconsidered some things and that lead me to few discoveries. I've verified some items and will be working on the rest tomorrow, I hope. There were multiple problems (some in my testing procedure, some not) and clearing some of them has made it much easier to focus on what's left.

The most common Windows "1ms" system tick is actually 1.0001ms. Rsc's and brainman's systems use 976.5us. This difference explains some of the descrepencies in the results. When tested with 1ms, the faster tick (976.5us) systems produced steady, 2x results. A 1.0001ms system produces either 1x, 2x, or a mix of 1x and 2x results.

The mix case is affected by anything which changes the latency of critical paths in either Windows or Go. Switching the system tick to 0.5ms is one of these factors. I expect that there is also some skew between the system resources being updated. Thus affecting the relationship between when the change occurs for nanotime versus when timerproc is run.

Finally, a few time I saw my system tick (averaged over time) at 1.0002ms, instead of 1.0001ms. So perhaps the actual rate is somewhere between, or there is some dithering taking place to make a non-integer division work.

I redid testing using 0.9ms and 1.1ms targets and the results were consistent and much less noisy. I've posted the revised test at https://play.golang.org/p/EItTKKM4SS for anyone who cares to try it.

There is no way to avoid such a boundary condition, but you can move it away from multiples of the system tick. Especially given the 1.0001ms vs 1ms (common usage) situation in Windows, this seems like a good idea to me. The fix is in runtine:time.go, timperproc.

Change this one line from:
    if delta > 0 {
to:
    if delta >= 0 {

Back to your message...

After the first tick, we check the time. We see that 1ms has not yet elapsed.

By definition, after a tick, 1.0001ms (or whatever a given system uses) has elapsed. The same system tick that wakes up timerproc also advances nanotime. It does not matter when during the interval the sleep/after was started since the nanotime is constant for the entire interval, except close to the timerproc boundaries.

So "when" can now equal nanotime, hence "delta" ("when" - nanotime) is 0.

For example: At 900us into a tick you call Sleep(1ms). "When" is set to nanotime + 1ms and you sleep. Then only 100us later the 1ms system tick takes place causing nanotime to advanced by 1ms and timerproc to run. When your "when" is checked the "delta" is 0, hence you timer is finished.

Side note: "When" is not a multiple of the tick interval. It's nanotime (which is) plus whatever duration the user requested (which isn't).

I note in particular that your program is printing time.Now, which is based on systime(_SYSTEM_TIME), but the timer algorithm described above is based on systime(_INTERRUPT_TIME). Any discrepancy between those, such as one somehow advancing out of sync with the other, could produce the results you are seeing.

True, but from what I read the only time that they don't tick together is when a user changes the system time. My NTP is set to adjust the clock once an hour. Regardless, as a sanity check, I used a modified runtime to access nanotime and I saw the same results. Indeed, comparing nanotime to the "when" field and seeing that it wasn't constant (as I was naively expecting) is what made me look into the boundary effects.

Retesting with 0.9ms sleep plus 0.25ms of work still takes place in 1ms. I believe this is proof that sleep intervals are not "at least the duration d". So this problem remains. I realize that the test I have now doesn't show this as clearly as I would like.

I do have another test which does show this more clearly, and doesn't invlove looping to get a measurable (with Go's functions) result. Instead, it uses the media timer which I have found to have 550ns resolution with about 300ns overhead on my system. So I trust it to about 1us - which is plenty good enough to see fractions of a 1ms interval.

Here's what I see.

    cntr    work     sleep     total
    ----   ------    ------    ------
       0 :    1us     983us     984us
     100 :  116us     899us    1015us
     200 :  235us     758us     994us
     300 :  265us     603us     868us
     400 :  398us     695us    1093us
     500 :  480us     533us    1013us
     600 :  552us     552us    1104us
     700 :  624us     472us    1096us
     800 :  861us     145us    1006us
     900 :  774us     125us     900us
    1000 :  827us      58us     885us

There is an outer loop which varies the amount of work, which is followed by a Sleep(900us) call. Each of the lines is a single execution, so they are noisy. Even so, note that the total column is faily constant. The amount of time slept decreases (well below 1ms) as the amount of work done increases. Each line is prefaced by a call to Sleep just to get in sync with the system tick. This is why as more work is done, less time remains in the tick. The Sleep only lasts for the remaining fraction of the tick.

Tomorrow I'll add a simpler test. What I have in mind is when the user hits a key then I'll do a Sleep(900us) and time it with the media timer. According to the doc's and what your and Russ say, this should take longer than 900us (actually, longer than a 1ms tick), right? I'm expecting it won't. This is no better proof that what I posted above, but it is simpler and easier to understand. I'll post some results along with the code, but it will only run on a Windows system because of the use of the media timer.

Question: If I am right about this, might not the same failure apply to other platforms? What is the timer resolution on a typical Linux system, for example? Perhaps it is so small (1us?) that getting a lesser sleep time isn't important. The only reason I ask is that the problem/fix might not be in Windows-specific code. :)

johnrs commented 7 years ago

I've posted another type of test program which shows time.Sleep not waiting "at least the duration of d". This program is Windows-specific due to the use of a Windows high speed timer to directly measure the time that Sleep actually sleeps (plus overhead to block/unblock).

https://play.golang.org/p/VDvUW-A_e5

@rsc @brainmen I hope that you are able to run this test so we can see if the problem is limited to my system or is generic to Windows (and possibly more).

alexbrainman commented 7 years ago

I hope that you are able to run this test

I did. Your program outputs this:

C:\>u:\test

Checks

  Timer resolution:    776ns
  time.Now resolution: 15.6249ms
  time.Sleep(1):       1.937487ms
  time.After(1):       1.968737ms
  Work(~1000us):       1094us

Automatic Tests

  time.Sleep(900us)                 time.After(900us)
     work    sleep    total            work    sleep    total
    ------   ------   ------          ------   ------   ------
       1us    456us    458us             1us    606us    607us
     162us    518us    680us           162us    787us    950us
     277us    510us    787us           274us    461us    736us
     384us    510us    894us           384us    461us    846us
     492us    390us    883us           492us    306us    799us
     599us    315us    915us           599us    328us    927us
     707us    450us   1158us           707us    384us   1092us
     814us    436us   1250us           814us    384us   1198us
     923us    518us   1441us           922us    506us   1428us
    1051us    381us   1432us          1027us    306us   1333us
    1141us    313us   1454us          1134us    526us   1661us

Manual Test

  Hit Enter for a Sleep(900us), Ctrl-C to exit:
C:\>

on one of my computers.

alexbrainman commented 7 years ago

I note in particular that your program is printing time.Now, which is based on systime(_SYSTEM_TIME), but the timer algorithm described above is based on systime(_INTERRUPT_TIME). Any discrepancy between those, such as one somehow advancing out of sync with the other, could produce the results you are seeing.

I exported nanotime and unixnano from runtime. If I run this program https://play.golang.org/p/9fA6SFEtl5 I see this output:

C:\>u:\test
[2135276082031300 1479105200813412200]
[2135276083007900 1479105200813412200]  976600  0
[2135276083984400 1479105200813412200]  976500  0
[2135276084961000 1479105200813412200]  976600  0
[2135276085937500 1479105200813412200]  976500  0
[2135276086914100 1479105200813412200]  976600  0
[2135276087890700 1479105200813412200]  976600  0
[2135276088867200 1479105200813412200]  976500  0
[2135276089843800 1479105200813412200]  976600  0
[2135276090820400 1479105200813412200]  976600  0
[2135276091796900 1479105200813412200]  976500  0
[2135276092773500 1479105200813412200]  976600  0
[2135276093750000 1479105200829037100]  976500  15624900
[2135276094726600 1479105200829037100]  976600  0
[2135276095703200 1479105200829037100]  976600  0
[2135276096679700 1479105200829037100]  976500  0
[2135276097656300 1479105200829037100]  976600  0
[2135276098632900 1479105200829037100]  976600  0
[2135276099609400 1479105200829037100]  976500  0
[2135276100586000 1479105200829037100]  976600  0
[2135276101562500 1479105200829037100]  976500  0
[2135276102539100 1479105200829037100]  976600  0
[2135276103515700 1479105200829037100]  976600  0
[2135276104492200 1479105200829037100]  976500  0
[2135276105468800 1479105200829037100]  976600  0
[2135276106445400 1479105200829037100]  976600  0
[2135276107421900 1479105200829037100]  976500  0
[2135276108398500 1479105200829037100]  976600  0
[2135276108398500 1479105200844662000]  0       15624900
[2135276109375000 1479105200844662000]  976500  0
[2135276110351600 1479105200844662000]  976600  0
[2135276111328200 1479105200844662000]  976600  0
[2135276112304700 1479105200844662000]  976500  0
[2135276113281300 1479105200844662000]  976600  0
[2135276114257900 1479105200844662000]  976600  0
[2135276115234400 1479105200844662000]  976500  0
[2135276116211000 1479105200844662000]  976600  0
[2135276117187500 1479105200844662000]  976500  0
[2135276118164100 1479105200844662000]  976600  0
[2135276119140700 1479105200844662000]  976600  0
[2135276120117200 1479105200844662000]  976500  0
[2135276121093800 1479105200844662000]  976600  0
[2135276122070400 1479105200844662000]  976600  0
[2135276123046900 1479105200844662000]  976500  0
[2135276124023500 1479105200844662000]  976600  0
[2135276125000000 1479105200860286900]  976500  15624900
[2135276125976600 1479105200860286900]  976600  0
[2135276126953200 1479105200860286900]  976600  0
[2135276127929700 1479105200860286900]  976500  0
[2135276128906300 1479105200860286900]  976600  0

C:\>

This is on my Windows XP.

Alex

johnrs commented 7 years ago

I'm guessing that this is a different machine or a different version of Go than the one you used last time? The time.Now resolution changed from 976.5us to 15.6249ms.

I don't understand how you saw ~1.9ms for the "1" (that's 1ns, so result should be shortest possible) checks, but during the tests, which use 900us the result was about 300 - 600us. How can a Sleep(1ns) take longer than a Sleep(900us)?

I note in particular that your program is printing time.Now, which is based on systime(_SYSTEM_TIME), but the timer algorithm described above is based on systime(_INTERRUPT_TIME). Any discrepancy between those, such as one somehow advancing out of sync with the other, could produce the results you are seeing.

I believe that I already addressed this and it applied to the first test program. The second test program uses the Windows media timer for the test times. I revised it to also use it for the checks at the beginning, but these are averages over 500 loops so are less critical.

New version here: https://play.golang.org/p/Gvf_ewUcgJ

I exported nanotime and unixnano from runtime. If I run this program https://play.golang.org/p/9fA6SFEtl5 I see this output:

OK, I did likewise. Here's what I see:

[462556264925700 1479114904698077800]
[462556265925700 1479114904699077900]   1000000 1000100
[462556266925800 1479114904700078000]   1000100 1000100
[462556266925800 1479114904701078100]   0       1000100
[462556267925900 1479114904701078100]   1000100 0
[462556268925900 1479114904702078200]   1000000 1000100
[462556269926000 1479114904703078300]   1000100 1000100
[462556269926000 1479114904704078400]   0       1000100
[462556270926000 1479114904704078400]   1000000 0
[462556271926100 1479114904705078500]   1000100 1000100
[462556271926100 1479114904706078600]   0       1000100
[462556272926100 1479114904706078600]   1000000 0
[462556273926200 1479114904707078700]   1000100 1000100
[462556274926300 1479114904708078800]   1000100 1000100
[462556274926300 1479114904709078900]   0       1000100
[462556275926300 1479114904709078900]   1000000 0
[462556275926300 1479114904710079000]   0       1000100
[462556276926400 1479114904710079000]   1000100 0
[462556277926400 1479114904711079100]   1000000 1000100
[462556278926500 1479114904712079200]   1000100 1000100
[462556279926500 1479114904713079300]   1000000 1000100
[462556280926600 1479114904714079400]   1000100 1000100
[462556280926600 1479114904715079500]   0       1000100
[462556281926700 1479114904715079500]   1000100 0
[462556281926700 1479114904716079600]   0       1000100
[462556282926700 1479114904716079600]   1000000 0
[462556282926700 1479114904717079700]   0       1000100
[462556283926800 1479114904717079700]   1000100 0
[462556283926800 1479114904718079800]   0       1000100
[462556284926800 1479114904718079800]   1000000 0
[462556284926800 1479114904719079900]   0       1000100
[462556285926900 1479114904719079900]   1000100 0
[462556285926900 1479114904720080000]   0       1000100
[462556286926900 1479114904720080000]   1000000 0
[462556286926900 1479114904721080100]   0       1000100
[462556287927000 1479114904721080100]   1000100 0
[462556287927000 1479114904722080200]   0       1000100
[462556288927100 1479114904722080200]   1000100 0
[462556289927100 1479114904723080300]   1000000 1000100
[462556290927200 1479114904724080400]   1000100 1000100
[462556290927200 1479114904725080500]   0       1000100
[462556291927200 1479114904725080500]   1000000 0
[462556291927200 1479114904726080600]   0       1000100
[462556292927300 1479114904726080600]   1000100 0
[462556292927300 1479114904727080700]   0       1000100
[462556293927400 1479114904727080700]   1000100 0
[462556293927400 1479114904728080800]   0       1000100
[462556294927400 1479114904728080800]   1000000 0
[462556295927500 1479114904729080900]   1000100 1000100
[462556296927500 1479114904730081000]   1000000 1000100

This is on the same WIndows 7, 64-bit, I've been using for my tests. Mine tick at the same rate.

While I do find the difference in how our systems work very interesting, I don't think that this invalidates either of my test programs. The first does use Time.Now but each test is 500 loops. So the accuracy is reduced from 1 part in 500 to 16 parts in 500. No big deal. The second test uses the Windows media timer and it's resolution is in the neighborhood of 1us, which is fine for measuring 100 - 1000us events.

John

alexbrainman commented 7 years ago

I'm guessing that this is a different machine or a different version of Go than the one you used last time?

I have few different computers. I do not keep track of what is happening on this issue, because I still do not understand what is the problem that we are solving here. Sorry.

Alex

johnrs commented 7 years ago

An example of the problem: On a Windows machine I do a time.Sleep(900 * time.Microsecond). I expect at least a 900us sleep. Sometimes I get much less, like 100us, instead.

alexbrainman commented 7 years ago

An example of the problem: On a Windows machine I do a time.Sleep(900 * time.Microsecond). I expect at least a 900us sleep. Sometimes I get much less, like 100us, instead.

I could be wrong, but I believe what you describe is possible.

lets say clock changes every 15ms;
lets say clock changed 14.5 ms ago and it was set to 1000ms at the time;
you start your your sleep(900us);
the sleep routine gets the clock (the 1000ms), and sets to wait for 1000.9ms;
sleep routine happened to check time again after 600us, and the clock will be set to 1015ms, so it will return from sleep (only after 600us)

Alex

johnrs commented 7 years ago

I don't entirely agree with description in your example, but yes, the conditions do produce the failure mode.

lets say clock changes every 15ms;

OK.

lets say clock changed 14.5 ms ago and it was set to 1000ms at the time;

OK. And it is still set to the same time since there hasn't been a clock tick.

you start your your sleep(900us);

OK. Note that the problem can happen regardless of the clock speed.

the sleep routine gets the clock (the 1000ms), and sets to wait for 1000.9ms;

OK. It gets the current time (nanotime) and adds the duration (900us) to it. Thus 1000ms + 900us ==> 1000.9ms is stored in the timer's "when" field. Sleep then adds the timer to the timer list and blocks the goroutine. It then waits until the the clock equals or exceeds the "when" time (1000.9ms).

sleep routine happened to check time again after 600us, and the clock will be set to 1015ms, so it will return from sleep (only after 600us)

Actually timerproc will run 500us later when the clock ticks. Yes, it will then see that nanotime is past the "when" time, so it ends the sleep. The goroutine is unblocked.

So now that we agree that it can/does happen, which is incorrect: The code or the doc's?

Russ was clear that the doc's are correct, so then the fault must lie with the code. And since this portion of the code is generic (not Windows specific) I suspect that problem occurs on non-Windows systems too. Perhaps it so small that it's hard to notice. I'd love to see the results of running the current version of the first test program (https://play.golang.org/p/EItTKKM4SS) on some non-Windows systems.

I have a couple of ideas about a code fix, but I'll save those till we are all in agreement that the code does need to be fixed. :)

alexbrainman commented 7 years ago

Perhaps it so small that it's hard to notice.

That is my sentiment exactly. I don't think it is worth changing documentation or the code.

Alex

johnrs commented 7 years ago

No, you took that quote out of context. It referred to non-Windows systems, and I was thinking 1us or less.

On Windows, with a 1ms clock tick, a 1ms sleep can be over 10x too short. It caused a problem in one of my programs - a very nasty problem.

The doc's go out of their way to say this should not happen. Implemented properly, the worst error would be 2x too long. So I have to think that whoever wrote the doc's understood and cared.

ianlancetaylor commented 7 years ago

Given an attempt to sleep for less time than the precision of the clock being used, the only possible option that ensures that we sleep at least that long is to extend the sleep to the precision of the clock. Currently we assume that the clock is sufficiently precise, as is generally true on Unix systems.

Perhaps we could change the Windows code to use QueryPerformanceCounter. That would give us a higher precision clock source in Windows.

johnrs commented 7 years ago

It would be interesting to test a variety of Unix systems to see what the precision is. I believe the only result posted here was by rakyll, for darwin. It appeared to be about 250us, even though the ToD clock was about 48ns.

A fix to the present code wouldn't be all that bad, I think. The first though I had was to add the tick duration to all sleep durations. The only problem is determining the tick duration and allowing for it changing (which seems possible). The other method I thought of is to add a "min" field to the timer struct. It could be used to enforce a 2 tick minimum. This solves the problem at the boundary but does not lengthen longer sleeps (which is fine by me).

Of course using a more precise tick, like QueryPerformanceCounter on Windows, would be a dream! It ticks at 550ns on my system. So it would improve precision by about 2000 times! But can it be done? It doesn't seem to provide an asynchronous call. So it would require some interfacing. Also, the overhead time it takes to read it might be a concern. I once measured time.Now taking 13ns, and QPC at 301ns.

alexbrainman commented 7 years ago

Perhaps we could change the Windows code to use QueryPerformanceCounter. That would give us a higher precision clock source in Windows.

I had trouble using QueryPerformanceCounter in the past. I do not remember details, but, if someone is interested, they can read comments in issue #6007 and https://codereview.appspot.com/108700045.

If someone wants to change runtime to use QueryPerformanceCounter, they are welcome.

Alex

rsc commented 7 years ago

@johnrs, if you can check out the latest master branch of Go and try your original test program again, I suspect you are going to see better results than before. As part of #12914, the computation d = time.Now().Sub(t1) in your program should now be at the same accuracy as the sleep itself, whereas before it might have been at a lower resolution. If that's the case, then the problem in the original program was the measurement, not the sleep. And now perhaps the measurement is fixed.

Worth a try at least.

Thanks. Russ

johnrs commented 7 years ago

@rsc OK, will do.

bradfitz commented 7 years ago

Timed out. Assuming this is fixed with the monotonic time changes.

Let me know if not.

johnrs commented 7 years ago

Hi Brad. I'll take a look at it later today.

johnrs commented 7 years ago

@bradfitz I am still seeing the problem (or at least part of it) on my Windows machine. I should have a more complete report in a few hours.

johnrs commented 7 years ago

@bradfitz - I am still seeing the basic problem on my machine. Here's a sample output. ` time.Now() resolution: 1.0001ms time.Sleep(1) resolution: 1.0041ms time.After(1) resolution: 1.0001ms Work runtime (~0.2510ms): 0.2520ms

 Sync Test         0          1      0.9037ms   1.1045ms
------------   --------   --------   --------   --------
   Sleep       0.0000ms   1.0001ms   1.0001ms   2.0002ms
Work + Sleep   0.2780ms   1.0001ms   1.0021ms   2.0042ms
   After       0.0020ms   1.0001ms   1.0001ms   2.0002ms
Work + After   0.3040ms   1.0021ms   1.0021ms   2.0022ms

` The Sleep and After (without Work) are fine. But when Work is added to the loop both the "1" and the "0.9ms" columns should be showing 2ms, not 1ms. You can't Sleep for 1ms (the minimum Sleep interval) and work for 0.25ms with the result being less than the sum of the two.

I've uploaded the code at https://play.golang.org/p/6CMV59o7QF. There aren't any strange dependencies or anything which is Windows specific.

I also ran tests using the Windows-specific Media timer. This gives me resolution of about 1us. So it's very easy to see exactly how long things take. A non-synchronized (not in a loop) Sleep(1) takes anywhere from about 50us to 1ms. If you then call a second Sleep it will take about 1ms because you are calling just after the system clock ticked. This is why a Sleep of 1.1ms, in a loop, takes 2ms (except perhaps for the very first one). If anyone wants the code I can upload it.

I've made the test periods 90% and 110% of the Sleep(1) time. This should show the problem on Windows systems which use something other than the common 1ms clock. So @rsc and @alexbrainman should be able to see the problem on their Windows systems now, too.

I have no idea whether the problem also exists on Linux or other systems. I would love to see the results of any such test runs. I think that the code should scale down to a 10us clock. Faster than that I don't know.

Even if what I say turns out to be correct, why should it be changed? Because the doc's clearly say otherwise. I'm fine with changing the doc's instead, but I do think that changing the code is the safer way to go. Why? Well, when I last looked I found 7 usleep calls in runtime (mgcwork.go and proc.go) which were 1ms or less. My previous testing showed that usleep had the same problem. Thus they could result in MUCH smaller delays than expected.

A result MUCH shorter than requested can be a serious problem, especially since it probably isn't deterministic (not synchronized to the system clock). Think about a timeout which happens 10 times faster than requested, for example. The fix would, at worst case (which is at the 1ms border) would be a 2ms wait. This would also make port code between platforms safer. The slower time will probably cause fewer problems and will be easier to troubleshoot, rather than a way-too-fast, intermittent timer.

johnrs commented 7 years ago

Side Note - I don't know if this is related or not. When I drop the Windows system clock to 0.5ms then run my test program I get this (top part)L

time.Now() resolution:     0.5001ms
time.Sleep(1) resolution:  1.0002ms
time.After(1) resolution:  1.0102ms

It seems odd to me that Now tracks the system clock but Sleep and After don't.

ianlancetaylor commented 7 years ago

Thanks for the latest test program. You are making an indirect argument based on the precision of the system clock: you are measuring the resolution of time.Sleep and asserting that if you work for some fraction of that resolution and then Sleep for that resolution you must move on to the next time period. But that assumption only holds if the timing of clock ticks is consistent across all threads in the program.

I don't know about Windows, but on GNU/Linux there is no attempt to produce consistent time ticks across all cores and threads. There is no reason for that: the only important goal is that a request to sleep for a given amount of time sleeps for at least that time. And I repeat that your program doesn't show that that request is violated, it only makes an indirect argument suggesting that it is violated.

On GNU/Linux, I see very inconsistent results in the "Work + " lines. If I change the program to add runtime.LockOSThread() as the first statement in main, they become much more consistent. That call makes it more likely to see consistent thread ticks in the program--though even then it is not proven, as a thread may migrate between cores.

I would be interested in hearing your results on Windows if you call runtime.LockOSThread.

johnrs commented 7 years ago

@ianlancetaylor - Thank you for your insightful comments. Yes, using a time source to make observations about it own use is odd. My goal was to demonstrate the problem without relying on external methods. That why I chose to call this the "Synchronous Test". Also, this runs on non-Windows platforms too, I hope.

As of Go v1.7 (I haven't looked at v1.8), on Windows I believe that the runtime supplies the Now/Sleep/After time in an interesting way. It doesn't use a Windows call to get the time. Instead, it finds out where the time is stored in memory, then does a direct read of those memory locations. Since this is a multi-word read without synchronization it uses the Microsoft (they allowed for this) recommended method. I don't know if Windows has separate such locations per thread, but I don't believe that it matters since Go's runtime (for Windows) seems to only use one of them. I might well be wrong about this however, since I do have some trouble following the runtime code in places.

I would love to see your GNU/Linux results! Could you post them or email them to me?

I added the runtime.LockOSThread call as you recommended. I see no change in the results.

time.Now() resolution:     1.0001ms
time.Sleep(1) resolution:  1.0001ms
time.After(1) resolution:  1.0001ms
Work runtime (~0.2500ms):  0.2500ms

 Sync Test         0          1      0.9001ms   0.9901ms   1.0101ms   1.1001ms
------------   --------   --------   --------   --------   --------   --------
   Sleep       0.0000ms   1.0021ms   1.0001ms   1.0021ms   2.0002ms   2.0002ms
Work + Sleep   0.2780ms   1.0161ms   1.0421ms   1.0561ms   2.0262ms   2.0182ms
   After       0.0020ms   1.0461ms   1.0001ms   1.0001ms   2.0002ms   2.0002ms
Work + After   0.2820ms   1.0381ms   1.0401ms   1.0201ms   2.0242ms   2.0582ms

Also, besides the 90% and 110% test points I added 99% and 101%. I wanted to make it more certain that the Work (25%) + Sleep (99%) case clearly exceeded the Sleep(1) (100%) time. Please note that the results are identical regardless of Work being added or not, except for the "0" case since it doesn't align with the system clock.

I should add that this isn't a problem seen on just my machine. I've seen it on about 6 other Windows systems, running different OSs (XP, W7, W10), and on both 32 and 64-bit systems. The only Windows systems which passed the test - initially - had clocks faster than 1ms and the test didn't allow for that at the time.

OK, now here's the direct evidence. Its more straightforward. :)

I have another test program (ASync) which uses the Windows Media Timer. On my system its resolution is 550ns. Infrequently, due to scheduling I suppose, I see twice that. I loosely say that it's good to about 1us. So there's plenty of resolution to look at a 1ms clock. I have added the runtime.LockOSThread call to this test, too.

After the initial Checks, there are two sets of tests (complicated and simple). Here are the Checks and the Automatic Tests (Manual Test to follow) results.

Checks

   Timer tick:     549ns
   time.Now tick:  1.0001ms
  time.Sleep(1):   1.0081ms    1.007926ms
  time.After(1):   1.0001ms    1.000047ms
  Work(~1000us):   830us       831us

Automatic Tests

  time.Sleep(900us)                 time.After(900us)
     work    sleep    total            work    sleep    total
    ------   ------   ------          ------   ------   ------
       0us   1068us   1069us             1us   1005us   1006us
     126us    921us   1048us           194us    779us    973us
     194us    972us   1167us           487us    556us   1043us
     349us    711us   1060us           305us    725us   1031us
     337us    712us   1050us           376us    645us   1022us
     426us    637us   1063us           493us    553us   1046us
    1189us    741us   1930us          1266us    702us   1969us
    1544us    520us   2064us          1460us    549us   2009us
    1682us    250us   1932us          1677us    421us   2099us
    1916us   1107us   3024us          1949us   1016us   2966us
    2159us   1006us   3165us          2023us   1010us   3033us

These tests are one-shot, no looping and averaging. The work time gradually increases. The sleep time is always 900us (which should result in 1ms). The 3 times for each test were measured individually using the Media Timer. Notice that "work + sleep" time almost exactly matches the "total" time.

Now for the Manual Test, the simple one. :)

Manual Test

  Hit Enter for a Sleep(900us), Ctrl-C to exit:
     566us
     413us
      67us
     458us
     952us
     408us
     901us
     470us
     272us
     394us
     562us
     743us
     403us
     903us
     775us
     616us
     404us
     490us
    1051us
     217us
     381us
     121us
     267us
     601us
     639us

There nothing indirect about these results. This shows how long a single Sleep(900us) takes. The measurement is triggered by hitting Enter on the keyboard (not sync'ed to the system clock).

I have checked to ensure that the Media Center timer is operating correctly. All of the tests I ran comparing it to the normal Windows clock gave the correct results. This code will only run on Windows, of course. If you would like a copy I'll be glad to post it (after I clean it up a bit).

There you have it. A Sleep(900us) with a system clock of 1ms should sleep for at least 1ms. From the time doc: "Sleep pauses the current goroutine for at least the duration d." Ditto for After (via NewTimer). Plus/minus some jitter in my system, it doesn't.

alexbrainman commented 7 years ago

But when Work is added to the loop both the "1" and the "0.9ms" columns should be showing 2ms, not 1ms. You can't Sleep for 1ms (the minimum Sleep interval) and work for 0.25ms with the result being less than the sum of the two.

Sure you can.:

1) imaging that your clock cannot display fractions of 1ms; 2) imaging that your clock shows 10ms, but the actuall time is 10.5ms; 3) you call time.Now() and it will return 10ms; 4) you call work(0.25ms); 5) when work(0.25ms) returns, the actual time is 10.75ms; 6) you call time.Sleep(1ms); 7) time.Sleep(1ms) calls time.Now() to work out how long to sleep, time.Now() will return 10ms, time.Sleep will add the 1ms to it, and will go to sleep until 11ms; 8) once time gets to 11ms, time.Sleep(1ms) returns; 9) you call time.Now(), and it returns 11ms; 10) you subtract time.Now() value at 3) from 9), and you get 1ms.

I think we are going around the circle here. We already had this conversation at https://github.com/golang/go/issues/17696#issuecomment-259887709

Alex

golang / go