twisted / twisted

Event-driven networking engine written in Python.
https://twisted.org
Other
5.58k stars 1.17k forks source link

callLater(0) does not guarantee relative ordering of calls #6718

Open twisted-trac opened 18 years ago

twisted-trac commented 18 years ago
jyknight's avatar @jyknight reported
Trac ID trac#1396
Type enhancement
Created 2006-01-02 11:22:35Z
Branch https://github.com/twisted/twisted/tree/callLater-1396-3

Attachments:

Searchable metadata ``` trac-id__1396 1396 type__enhancement enhancement reporter__jknight jknight priority__low low milestone__ branch__branches_callLater_1396_3 branches/callLater-1396-3 branch_author__ status__new new resolution__ component__core core keywords__ time__1136200955000000 1136200955000000 changetime__1423321370009693 1423321370009693 version__ owner__ cc__glyph cc__exarkun cc__jknight cc__zooko ```
twisted-trac commented 16 years ago
therve's avatar @therve set owner to @therve
twisted-trac commented 13 years ago
Automation's avatar Automation removed owner
twisted-trac commented 9 years ago
adiroiban's avatar @adiroiban commented

As as #7776 I am cleaning the code of todo tests

looks like there is not much progress on this

for reference, here is the removed text


    def testCallLaterOrder(self):
        l = []
        l2 = []
        def f(x):
            l.append(x)
        def f2(x):
            l2.append(x)
        def done():
            self.assertEqual(l, range(20))
        def done2():
            self.assertEqual(l2, range(10))

        for n in range(10):
            reactor.callLater(0, f, n)
        for n in range(10):
            reactor.callLater(0, f, n+10)
            reactor.callLater(0.1, f2, n)

        reactor.callLater(0, done)
        reactor.callLater(0.1, done2)
        d = Deferred()
        reactor.callLater(0.2, d.callback, None)
        return d

    testCallLaterOrder.todo = "See bug 1396"
    testCallLaterOrder.skip = "Trial bug, todo doesn't work! See bug 1397"
    def testCallLaterOrder2(self):
        # This time destroy the clock resolution so that it fails reliably
        # even on systems that don't have a crappy clock resolution.

        def seconds():
            return int(time.time())

        base_original = base.seconds
        runtime_original = runtime.seconds
        base.seconds = seconds
        runtime.seconds = seconds

        def cleanup(x):
            runtime.seconds = runtime_original
            base.seconds = base_original
            return x
        return maybeDeferred(self.testCallLaterOrder).addBoth(cleanup)

    testCallLaterOrder2.todo = "See bug 1396"
    testCallLaterOrder2.skip = "Trial bug, todo doesn't work! See bug 1397"
twisted-trac commented 18 years ago
jyknight's avatar @jyknight commented
#!html
<pre>
Ideally, we should use a timing source that isn't adjustable, but I don't think 
we have one. callLater is certainly not callAtTime. It is, however, a call 
queuing function. Ordering should most definitely be deterministic.

Having the user set system time should not cause all timeouts to fire 
immediately. However, that's a separate issue. The only part I'm concerned with 
is that the clock moving backwards should also not violate ordering.
</pre>
twisted-trac commented 18 years ago
jyknight's avatar @jyknight commented
#!html
<pre>
Or more generally, callLater(N, A); callLater(N, B) should *always* execute A 
before B.
</pre>
twisted-trac commented 18 years ago
glyph's avatar @glyph commented
#!html
<pre>
I disagree.  callLater is not callAtExactly - call queueing should be done with
other mechanisms, and perhaps triggered by callLater.  As I mentioned
previously, we don't deal with the user setting the system clock so at worst
calling this a "guarantee" is incorrect.  Also, even providing the naive version
of this guarantee may preclude implementing more a more efficient callLater at
some point.
</pre>
twisted-trac commented 18 years ago
glyph's avatar @glyph commented
#!html
<pre>
See #1398 for more information.
</pre>
twisted-trac commented 18 years ago
glyph's avatar @glyph commented
#!html
<pre>
I think that the bug here is that the documentation does not include the title
of this issue.  Guaranteeing relative ordering of calls is not realistically
possible for a timed-call mechanism.  Doing it more often is probably good, so
I'm not saying that changing the win32er implementation to work around its
super-low-precision timer is a bad idea, but saying we provide a guarantee is a
ton of work.

Consider a system with a lot of clock wobble that syncs the time every hour from
NTP; there are LOTS of opportunities for time() to go backwards in time, and
it's not clear what the right thing to do is, unless time is always measured
from the beginning of the last reactor tick - and that's bad because events
often do take non trivial amounts of wall-clock time to process.

People who need deterministic ordering should NOT use callLater.  That's simply
not what it's for.
</pre>
twisted-trac commented 18 years ago
exarkun's avatar @exarkun commented
#!html
<pre>
clock_gettime(CLOCK_MONOTONIC) is supposed to be non-adjustable.

I don't think we should use it though.  I think the current behavior of
callLater() is perfectly reasonable.  The documentation should be improved, but
the code should be left alone for now.
</pre>
twisted-trac commented 18 years ago
jyknight's avatar @jyknight commented
#!html
<pre>
Your "consider" example is flawed. That is not how NTP is used. It instead adjusts 
a slew factor, causing time to go faster or slower than the hardware clock. That's 
not to say that the clock will never be changed, of course.

Guaranteeing relative ordering *is* realistically possible, unless you see a 
reason why the code on my branch should not work.
</pre>
twisted-trac commented 18 years ago
exarkun's avatar @exarkun commented
#!html
<pre>
It's relatively common to have NTP used incorrectly, where ntpdate is run once
an hour or once a day, causing the system clock to jump.
</pre>
twisted-trac commented 18 years ago
glyph's avatar @glyph commented
#!html
<pre>
That's why I said "a lot of clock wobble".  From the NTP documentation: "If
ntpdate determines the clock is in error more than 0.5 second it will simply
step the time by calling the system settimeofday() routine. If the error is less
than 0.5 seconds, it will slew the time by calling the system adjtime() routine.
The latter technique is less disruptive and more accurate when the error is
small, and works quite well when ntpdate is run by cron every hour or two."

The code on your branch doesn't work because in order to achieve a relatively
subtle change in semantics for callLater(N,...) callLater(N,...) which is of
dubious utility (the issue that spawned this discussion has since been resolved
by other means, and I think, to warner's satisfaction) you want to change the
semantics of Platform.seconds() to mean something almost completely different
than what it currently does.  Currently, if a Twisted process is started and
then the clock is changed, callLater calls started before the time is adjusted
might be delayed for an unreasonable amount of time, or run sooner than they
expected to be.  With your proposed change, if a Twisted process is started and
the clock is changed, some of those behaviors will likely be fixed (although I
note that in the absence of an external point of reference, your fix is only
heuristically correct), but application code has no opportunity to make the time
correct on its own, since platform.seconds now be reporting a random value. (In
other words, "you have no chance to survive, or make your time (correct)."  Hee
hee.)

The application code might be calling time.time() itself in this case, but why
should it?  Twisted's Platform.seconds is advertised as more portable, and is
(was) supposed to return a semantically equivalent value.

Perhaps the right thing to do here is to apply the time-going-backwards
detection from this patch and simply to provide a hook for the reactor to alert
application code that it may have gone backwards in time?  Being set forwards is
something that application code can detect on its own, if it cares about
real-time scheduling.
</pre>
twisted-trac commented 18 years ago
glyph's avatar @glyph commented
#!html
<pre>
Ugh.  Please forget I used the word "real-time" there.  I mean "accurate
wall-clock".
</pre>
twisted-trac commented 18 years ago
jyknight's avatar @jyknight commented
#!html
<pre>
Yes, ntpdate does that, but that generally runs once at system boot time. The 
ntp daemon does not do that. I've never seen a system that installs a cron job 
with ntpdate rather than ntp, but maybe I've not looked hard enough.

Okay, so instead, I'd like to introduce platform.realSeconds(), which returns a 
value which may or may not have anything to do with the epoch, and which is 
guaranteed monotonic and strictly increasing. The reactor timers should use 
this. It should ideally use clock_gettime if available, otherwise use the 
heuristics on top of gettimeofday. 

callLater *should not* be affected by real time changes, if it's possible to 
avoid, as it has nothing to do with real time, but rather elapsed time 
intervals. That it may be affected by real time is an implementation deficiency, 
not a feature.

And finally: before I broke it, callLater *did* guarantee ordering. There's 
really no reason to think it should not behave the way warner expected it to, 
and if both that and the misbehavior upon time-setting can be fixed with one 
change, all the better.
</pre>
twisted-trac commented 18 years ago
glyph's avatar @glyph commented
#!html
<pre>
> I'd like to introduce platform.realSeconds()

Hmm.  OK, that doesn't sound like a bad plan.  Naming's important though, and
I'd like to be careful to avoid putting "real" anywhere near the word "time",
especially since it's got a heuristic that makes it not-real in many cases.  Can
we call it "monotonicSeconds" or something?  (Perhaps we should also rename the
other one 'secondsFromEpoch' and deprecate just 'seconds', to stress the
distinction.)

Re-jigger your branch and I'll re-review?

That reminds me; if we're going to use clock_gettime and add our own time
interface, we might want to add a primitive interface to the reactor that uses
integer milliseconds.  I still occasionally get grief from DirectX developers
that they have to force the FPU back into double precision mode so that Twisted
will work :)
</pre>
twisted-trac commented 18 years ago
exarkun's avatar @exarkun commented
#!html
<pre>
&lt;http://scienceworld.wolfram.com/physics/ProperTime.html>

Regarding milliseconds, clock_gettime returns a timespec, which consists of a
time_t representing seconds and a long representing nanoseconds.  Resolution is
variable, but queryable.  On my system (x86, P4, Linux 2.6.12) it is 0.999848
milliseconds.
</pre>
twisted-trac commented 18 years ago
jyknight's avatar @jyknight commented
#!html
<pre>
Branch seems good, please review. 

I changed the reactor-internal APIs to use integer nanoseconds rather than 
floats, as well. One somewhat suspect bit of API is DelayedCall.getTime(), which 
cannot return a truly accurate trigger time, since DelayedCalls are defined in 
terms of an increment of time, rather than an absolute time.

Of additional note is that I changed distutils.build_extension to allow failures 
to build C modules, and not abort the build process.
</pre>
twisted-trac commented 18 years ago
jyknight's avatar @jyknight commented
#!html
<pre>
Oh, I guess I should mention:

svn diff -r15485:HEAD svn://svn.twistedmatrix.com/svn/Twisted/branches/
callLater-1396
</pre>
twisted-trac commented 18 years ago
exarkun's avatar @exarkun commented
#!html
<pre>
A few trivial points to start off (this does not represent a comprehensive review):

twisted/python/runtime.py: _posix_clock is being imported by a relative name. 
It should be imported as "twisted.python._posix_clock" instead.

staticmethod() was being used on Platform to prevent the creation of bound
methods.  While time.time is a builtin function, and builtin functions aren't
descriptors, I am pretty sure these are both implementation details.  It'd be
nice to at least preserve the level of safety that was in this code before.

1000000000 and 10000000 could stand to be lifted out and replaced with symbolic
constants.  Was 0.01s selected for any particular reason (ie, does some common
operation result in adjustments by increments slightly smaller or greater than
that), or does it just work well in practice?

twisted/internet/{base,task}.py, twisted/test/{test_task,test_internet}.py: 
Again, 10 ** 9 and  should be a symbolic constant, preferably one imported from
someplace else.

twisted/python/_posix_clock.c: We don't have a coding standard for C written
down anyplace, but if we did I imagine it would say things about trailing
whitespace and tabs. ;)  Also, I would hope it would also recommend {} around
all blocks, even where they are not necessary.

Branch mostly looks good, but I doubt my head has fully absorbed the various
implications yet, particularly wrt DelayedCall.
</pre>
twisted-trac commented 18 years ago
jyknight's avatar @jyknight commented
#!html
<pre>
As I understand it, best practice is for C modules to be imported with a 
relative name, so that they can be moved to the toplevel when building a zip or 
what have you without any code changes.

1000000000 could be replaced with a variable like "TENTOTHENINTH" or 
"NANOSECONDSPERSECOND" but that's pretty much just what it means..the interface 
is defined to return nanoseconds, not 'symbolic-constant's-per-second.

0.01s is a value I just made up. The constraints are that it must be larger than 
the most number of times you can call monotonicTicks without the system clock 
increasing, and must be smaller than you "care" to lose if the clock moves 
backwards.
</pre>
twisted-trac commented 17 years ago
zooko's avatar @zooko commented

So what needs to be done to make this branch ready to be reviewed to be merged into trunk?

twisted-trac commented 17 years ago
exarkun's avatar @exarkun commented

Merged this forward to callLater-1396-2 and resolved the copious conflicts (most of which made little sense, someone should write a better merge algorithm).

_posix_clock.c also no longer compiles for me :/ _XOPEN_REALTIME doesn't seem to be defined. The man page suggests _POSIX_TIMERS and _POSIX_MONOTONIC_CLOCK might be more appropriate to check for, but I haven't dug very deeply and I'm sure there are plenty of subtle differences between these (doing a probe is probably a better idea though).

twisted-trac commented 17 years ago
jyknight's avatar @jyknight commented

I think unistd.h might need to be included first.

If not, can you do a: getconf _XOPEN_REALTIME in a shell?

And if that returns 1, grep all your header files for _XOPEN_REALTIME being defined?

twisted-trac commented 17 years ago
exarkun's avatar @exarkun commented

getconf returns 1, _XOPEN_REALTIME is defined as 1 in posix_opt.h, which is included by unistd.h.

twisted-trac commented 16 years ago
therve's avatar @therve commented

(In [22511]) Branching to 'callLater-1396-3'

twisted-trac commented 16 years ago
zooko's avatar @zooko commented

Here's a version of this patch against current trunk.

twisted-trac commented 16 years ago
zooko's avatar @zooko commented

Ah, and here is a version that actually builds on my Mac OS X system.

twisted-trac commented 16 years ago
zooko's avatar @zooko commented

I have a fellow who is running a VirtualBox vm, and his system clock jumps around crazily, so that buildbot "times out" his jobs, thinking that more than 1200 seconds have passed:

http://buildbot.darcs.net/buildbot-darcs/builders/karel%20opensolaris%20OpenSolaris-2008.05%20nevada-86-rc3%20i386/builds/5/steps/compile/logs/stdio

So I asked him to install this patch, and now buildbot no longer does that:

http://buildbot.darcs.net/buildbot-darcs/waterfall

twisted-trac commented 16 years ago
glyph's avatar @glyph commented

Replying to zooko:

I have a fellow who is running a VirtualBox vm, and his system clock jumps around crazily, so that buildbot "times out" his jobs, thinking that more than 1200 seconds have passed:

So I asked him to install this patch, and now buildbot no longer does that:

This sets off alarm bells for me. Why would this patch stop timing out his jobs? This doesn't sound like "hooray we fixed a bug", it sounds like "oh no, we made certain timed events stop firing when they should".

twisted-trac commented 16 years ago
zooko's avatar @zooko commented

The jobs were timing out after only a few seconds, because the gettimeofday() had jumped more than 1200 seconds. That's my theory anyway. Make sense?

I think this might be hooray we fixed the bug.

By the way, along the way I found some discussions on the libevent mailing list about situations which called for CLOCK_MONOTONIC:

http://monkeymail.org/archives/libevent-users/2006-October.txt

http://monkeymail.org/archives/libevent-users/2007-July.txt

twisted-trac commented 16 years ago
zooko's avatar @zooko commented

Oh, my informant informs me that the 1200-second timeouts are still going off spuriously (long before 1200 actual seconds have passed). He is checking to see whether the build of Twisted actually built the posix_clock module...