Timer overflow issues? - Githubissues

orithena commented 6 years ago

I found a reason for lockup after letting sled run for over an hour on a 32-bit linux machine. Looks like a timer/counter overflow issue to me, because sled locks up (a.k.a. sleeps for waay too long) after roughly an hour (running gfx_candyflow with out_sdl2), which is consistent with 2^32 microseconds (= 4292s = 71.5m = 1.19h).

Here is the full gdb backtrace full output. I did not dive into the timer code, so I do not fully understand what is going on here, but tnow, sleeptime and desired_usec looks wrong to me.

#0  0xb7fd9cf9 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7e5221d in ___newselect_nocancel () at ../sysdeps/unix/syscall-template.S:84
No locals.
#2  0x0804d1d5 in oscore_event_wait_until (ev=0x805c498, desired_usec=4294960122) at src/os/os_unix.c:39
        tnow = 1803
        sleeptime = 4294958319
        oei = 0x805c498
        timeout = {tv_sec = 2845, tv_usec = 623426}
        set = {__fds_bits = {8, 0 <repeats 31 times>}}
#3  0x0804af6a in wait_until_core (desired_usec=4294960122) at src/timers.c:40
No locals.
#4  0xb7fd101b in wait_until (desired_usec=4294960122) at src/modules/out_sdl2.c:121
        tnow = 4294958964
        sleeptimems = 1
        ev = {type = 0, common = {type = 0, timestamp = 0}, window = {type = 0, timestamp = 0, windowID = 134534271, event = 0 '\000', 
            padding1 = 0 '\000', padding2 = 128 '\200', padding3 = 63 '?', data1 = 1065353216, data2 = 0}, key = {type = 0, timestamp = 0, 
            windowID = 134534271, state = 0 '\000', repeat = 0 '\000', padding2 = 128 '\200', padding3 = 63 '?', keysym = {
              scancode = 1065353216, sym = 0, mod = 17472, unused = 0}}, edit = {type = 0, timestamp = 0, windowID = 134534271, 
            text = "\000\000\200?\000\000\200?\000\000\000\000@D\005\b\000\000\000\000\000\000\005\b\370\355\377\277\243\262\004\b", 
            start = 134562880, length = 134545840}, text = {type = 0, timestamp = 0, windowID = 134534271, 
            text = "\000\000\200?\000\000\200?\000\000\000\000@D\005\b\000\000\000\000\000\000\005\b\370\355\377\277\243\262\004\b"}, 
          motion = {type = 0, timestamp = 0, windowID = 134534271, which = 1065353216, state = 1065353216, x = 0, y = 134562880, xrel = 0, 
            yrel = 134545408}, button = {type = 0, timestamp = 0, windowID = 134534271, which = 1065353216, button = 0 '\000', 
            state = 0 '\000', clicks = 128 '\200', padding1 = 63 '?', x = 0, y = 134562880}, wheel = {type = 0, timestamp = 0, 
            windowID = 134534271, which = 1065353216, x = 1065353216, y = 0, direction = 134562880}, jaxis = {type = 0, timestamp = 0, 
            which = 134534271, axis = 0 '\000', padding1 = 0 '\000', padding2 = 128 '\200', padding3 = 63 '?', value = 0, padding4 = 16256}, 
          jball = {type = 0, timestamp = 0, which = 134534271, ball = 0 '\000', padding1 = 0 '\000', padding2 = 128 '\200', 
            padding3 = 63 '?', xrel = 0, yrel = 16256}, jhat = {type = 0, timestamp = 0, which = 134534271, hat = 0 '\000', value = 0 '\000', 
            padding1 = 128 '\200', padding2 = 63 '?'}, jbutton = {type = 0, timestamp = 0, which = 134534271, button = 0 '\000', 
            state = 0 '\000', padding1 = 128 '\200', padding2 = 63 '?'}, jdevice = {type = 0, timestamp = 0, which = 134534271}, caxis = {
            type = 0, timestamp = 0, which = 134534271, axis = 0 '\000', padding1 = 0 '\000', padding2 = 128 '\200', padding3 = 63 '?', 
            value = 0, padding4 = 16256}, cbutton = {type = 0, timestamp = 0, which = 134534271, button = 0 '\000', state = 0 '\000', 
            padding1 = 128 '\200', padding2 = 63 '?'}, cdevice = {type = 0, timestamp = 0, which = 134534271}, adevice = {type = 0, 
            timestamp = 0, which = 134534271, iscapture = 0 '\000', padding1 = 0 '\000', padding2 = 128 '\200', padding3 = 63 '?'}, quit = {
            type = 0, timestamp = 0}, user = {type = 0, timestamp = 0, windowID = 134534271, code = 1065353216, data1 = 0x3f800000, 
            data2 = 0x0}, syswm = {type = 0, timestamp = 0, msg = 0x804d47f <oscore_mutex_unlock+12>}, tfinger = {type = 0, timestamp = 0, 
            touchId = 4575657221542958207, fingerId = 1065353216, x = 4.01034591e-34, y = 0, dx = 4.00232317e-34, dy = -1.99944973, 
            pressure = 3.99322916e-34}, mgesture = {type = 0, timestamp = 0, touchId = 4575657221542958207, dTheta = 1, dDist = 0, 
            x = 4.01034591e-34, y = 0, numFingers = 0, padding = 2053}, dgesture = {type = 0, timestamp = 0, touchId = 4575657221542958207, 
            gestureId = 1065353216, numFingers = 134562880, error = 0, x = 4.00232317e-34, y = -1.99944973}, drop = {type = 0, timestamp = 0, 
            file = 0x804d47f <oscore_mutex_unlock+12> "\005\201+", windowID = 1065353216}, 
          padding = "\000\000\000\000\000\000\000\000\177\324\004\b\000\000\200?\000\000\200?\000\000\000\000@D\005\b\000\000\000\000\000\000\005\b\370\355\377\277\243\262\004\b@D\005\b\260\001\005\b\000\000\000"}
#5  0x0804affe in wait_until (desired_usec=4294960122) at src/timers.c:55
No locals.
#6  0x0804a9a2 in sled_main (argc=0, argv=0xbffff168) at src/main.c:254
        tnext = {moduleno = 1, time = 4294960122, argc = 0, argv = 0x0}
        ch = -1
        outmod = "sdl2", '\000' <repeats 251 times>
        filternames = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x10000004 <error: Cannot access memory at address 0x10000004>, 
          0xb7ff2029 <calloc+73> "\203\304\020[^_\303VS\350\362'", 0x756e694c <error: Cannot access memory at address 0x756e694c>, 
          0x78 <error: Cannot access memory at address 0x78>, 0x0, 0x0, 0x0, 0x0, 0x0, 
          0xb7fe3eb0 <check_match+304> "\203\304\020\205\300\213T$\f\017\205v\377\377\377넍\264&", 0xb7fd9241 "LINUX_2.6", 
          0xb7ecf298 "LINUX_2.6", 0x0, 0x0, 0x0, 0x0, 0x0, 0x3ae75f6 <error: Cannot access memory at address 0x3ae75f6>, 
          0x75717300 <error: Cannot access memory at address 0x75717300>, 0x6b6165 <error: Cannot access memory at address 0x6b6165>, 
          0xb7fe3d89 <check_match+9> "\201\303w\262\001", 0xb7fd9128 "\031\243Cn\213*\306&", 
          0x7 <error: Cannot access memory at address 0x7>, 0xb7fffc08 "", 0x6e43a318 <error: Cannot access memory at address 0x6e43a318>}
        filterargs = {0xb7fe45a9 <do_lookup_x+1689> "\203\304 \205\300\017\205\233\373\377\377\213\003또\215\264&", 0x0, 0x0, 
          0xb7fd91a0 "\001", 0x7 <error: Cannot access memory at address 0x7>, 0xb7fd91c0 "", 0xb7fffc08 "", 
          0xbfffef6c "\230`\375\267.N=\366\251E\376\267\001", 0xbfffef68 "\356\b", 0x1 <error: Cannot access memory at address 0x1>, 0x0, 
          0xb7fff000 "<?\002", 0xb7ecf2a2 "__vdso_clock_gettime", 0xb7fd91a0 "\001", 
          0xb7fe3eb0 <check_match+304> "\203\304\020\205\300\213T$\f\017\205v\377\377\377넍\264&", 0xb7d8334c "GLIBC_2.0", 
          0x8049484 "GLIBC_2.0", 0x3721d18 <error: Cannot access memory at address 0x3721d18>, 0xbfffef68 "\356\b", 0xbfffeff8 "", 
          0x7 <error: Cannot access memory at address 0x7>, 0x0, 0xd696910 <error: Cannot access memory at address 0xd696910>, 0x0, 0x0, 
          0xb7fe3d89 <check_match+9> "\201\303w\262\001", 
          0xb7d73d24 "/N=\366\370\304-\327\317\030L\017$\301\324\361\370ԏӄ\"\233|\205\"\233|8\307\031u\354\373\300=\376\001\304\022\261\"\225\303\311BY\020\334\317쵶w\035\rG\336\315%\265V1\375\307r1\035\a;\372L\b~\222\034\215\t)\020\004\\H\261ԡ\034\240\070\265\357\060\002\352\331\017j\335\371{9\265\357\060\030\034s\354X?\227|T\200\314sقc\002;H\205\033\062v\340բ\230\313\362\250\247K\341\066\rf\375\326\036h\233\275\234#\217\274\350e\235\234\002Y1\v\264\006\337J\032\223\250P\265\250\020()\031΅)%~\016|\030\271\321\070\a\221\222\376\206\357\360I\265$\247:Vӂ\241", <incomplete sequence \344>..., 0x8ee <error: Cannot access memory at address 0x8ee>, 0xb7fd6098 "", 
          0xf63d4e2e <error: Cannot access memory at address 0xf63d4e2e>, 
          0xb7fe45a9 <do_lookup_x+1689> "\203\304 \205\300\017\205\233\373\377\377\213\003또\215\264&", 
          0x1 <error: Cannot access memory at address 0x1>}
        filterno = 0
        outarg = 0x0
        ret = 0
        filters = 0x0
        outmodno = 0
        lastmod = 1
#7  0x0804d094 in main (argc=1, argv=0xbffff164) at src/os/os_unix.c:17
No locals.
(gdb)

The timer was set in the module using

        frame++;
        nexttick += FRAMETIME;
        timer_add(nexttick, modno, 0, NULL);

vifino commented 6 years ago

Hmm. Yeah, sounds like an overflow. unsigned long just isn't big enough..

We could upgrade to unsigned long long or use milliseconds? I am not sure. Probably the former, but the latter would have benefits as well.

orithena commented 6 years ago

@vifino I'm not quite sure, whether it can be done, but I found that timer code can often be written in a way that the overflow is a non-issue because of the way the arithmetics are used. (No time to look into it currently, sorry.)

vifino commented 6 years ago

Hmm. Yeah, that'd be something to do.

It would certainly "fix" it longer-term.

orithena commented 6 years ago

@vifino I just looked into it and the solution is harder than anticipated.

First of all, the nexttick += frametime in each gfx module method creates the problem that nexttick always lags behind udate() if the frame calculation time inside the module exceeds the frametime.

Then, if udate() overflows when it's already past desired_usec while sled is working in the output module, sleeptime becomes waaay too big.

One solution is to check for (tnow < 0x50000000 && desired_usec > 0xA0000000) everytime tnow is compared to desired_usec (i.e. in sled_main, every os module and every output module).

Another would be to change timers into keeping starttime and delta instead of stoptime; having starttime and delta in the comparison allows for wrap-around arithmetics. Still, this does not solve the problem of modules lagging behind with nexttick. We'd need a function for that to be used by modules, so no module keeps it's own timer variable.

The method of changing all timer code to milliseconds would mean that there is a point every 49 days where sled stops working on unix. Also, it does not solve the problem of modules lagging behind with nexttick.

Upgrading timers to uint64_t seems to be the easiest method without having to think about wrap-arounds. The problem is just to make sure that it's being used throughout each and every source file. And, of course, the problem that sled will stop working at some point in the year 586512 would still be there. ^^

orithena commented 6 years ago

oh, wait. Not in the year 586512, because time_t is an int32_t (which means that sled would become susceptible to the year 2038 problem) or int64_t (which would make it a year 240000-something problem).

orithena commented 6 years ago

I created the branch timertest to debug timing. Checking for (tnow < 0x50000000 && desired_usec > 0xA0000000) is implemented there, but only in out_sdl2 and os_unix.

This branch includes a modified oscore_udate() in os_unix so that it starts out on the verge of overflowing. Change the offset in there to accomodate your machine's speed and create the situation where an output of tnow is very small while the desired_usec next to it is very big. This is the situation to debug!

orithena commented 6 years ago

Testing reveals: Current code in branch timertest is not hardened enough iff a module lags behind real time with its timer requests. It blocks in this situation:

.   out_sdl2                                        now:ffff7686    desired:8363012a
    main    tnext.time=8363012a wait_until=ffff7686
.   out_sdl2                                        now:ffff930c    desired:836307ac
    main    tnext.time=836307ac wait_until=ffff930c
.   out_sdl2                                        now:ffffb00a    desired:83630e2e
    main    tnext.time=83630e2e wait_until=ffffb00a
.   out_sdl2                                        now:ffffcc84    desired:836314b0
    main    tnext.time=836314b0 wait_until=ffffcc84
.   out_sdl2                                        now:ffffe90b    desired:83631b32
    main    tnext.time=83631b32 wait_until=ffffe90b
.   out_sdl2                                        now:0000057e    desired:836321b4
    out_sdl2                                        sleeptime:0x83631c36 us = dez: 2204310 ms
    out_sdl2                                        now:2be2a858    desired:836321b4
    out_sdl2                                        sleeptime:0x5780795c us = dez: 1468037 ms
    out_sdl2                                        now:2be2a88d    desired:836321b4
    out_sdl2                                        sleeptime:0x57807927 us = dez: 1468037 ms

The point here is now:0000057e desired:836321b4, i.e. udate() flows over and the module requests a timer to stop at some point x with 0x50000000 < x < 0xA0000000.

A solution would be to make sure that no module wants to request a timer with a stop time < udate() (i.e. this means it needs to update its internal nexttick counter to udate() too!!!)

vifino commented 6 years ago

Replacing it with a udate() + frametime would be okay.

A small fix would be to introduce a time type. unsigned long long or something like that. Would certainly make the time a bit longer before it hangs..

vifino commented 6 years ago

Hey @orithena, you ever got things working better?

orithena commented 6 years ago

Not conclusively. I haven't had the time yet.

I think we should encapsulate all timer related stuff in timers.h so that gfx-modules only need to call a function instead of maintaining the exact instant when they want to be called again. But I don't have a good idea on how to implement that yet.

vifino commented 5 years ago

Hey @orithena, you got any ideas at this point? :)

vifino commented 5 years ago

Some ideas:

Make the type bigger. Postpone fix. ;)
Make the waiting in oscore relative by default, make the timers relative, make a wrapper for absolute things.
Make a wrapper that handles the missed frame gracefully by having a really small stub decide whether to run or not to run for that frame.

vifino commented 5 years ago

Fixed* by #93.

shinyblink / sled

Timer overflow issues? #35