Closed orithena closed 5 years ago
Hmm. Yeah, sounds like an overflow.
unsigned long
just isn't big enough..
We could upgrade to unsigned long long
or use milliseconds?
I am not sure. Probably the former, but the latter would have benefits as well.
@vifino I'm not quite sure, whether it can be done, but I found that timer code can often be written in a way that the overflow is a non-issue because of the way the arithmetics are used. (No time to look into it currently, sorry.)
Hmm. Yeah, that'd be something to do.
It would certainly "fix" it longer-term.
@vifino I just looked into it and the solution is harder than anticipated.
First of all, the nexttick += frametime in each gfx module method creates the problem that nexttick always lags behind udate() if the frame calculation time inside the module exceeds the frametime.
Then, if udate() overflows when it's already past desired_usec while sled is working in the output module, sleeptime becomes waaay too big.
One solution is to check for (tnow < 0x50000000 && desired_usec > 0xA0000000) everytime tnow is compared to desired_usec (i.e. in sled_main, every os module and every output module).
Another would be to change timers into keeping starttime and delta instead of stoptime; having starttime and delta in the comparison allows for wrap-around arithmetics. Still, this does not solve the problem of modules lagging behind with nexttick. We'd need a function for that to be used by modules, so no module keeps it's own timer variable.
The method of changing all timer code to milliseconds would mean that there is a point every 49 days where sled stops working on unix. Also, it does not solve the problem of modules lagging behind with nexttick.
Upgrading timers to uint64_t seems to be the easiest method without having to think about wrap-arounds. The problem is just to make sure that it's being used throughout each and every source file. And, of course, the problem that sled will stop working at some point in the year 586512 would still be there. ^^
oh, wait. Not in the year 586512, because time_t is an int32_t (which means that sled would become susceptible to the year 2038 problem) or int64_t (which would make it a year 240000-something problem).
I created the branch timertest to debug timing. Checking for (tnow < 0x50000000 && desired_usec > 0xA0000000) is implemented there, but only in out_sdl2 and os_unix.
This branch includes a modified oscore_udate() in os_unix so that it starts out on the verge of overflowing. Change the offset in there to accomodate your machine's speed and create the situation where an output of tnow is very small while the desired_usec next to it is very big. This is the situation to debug!
Testing reveals: Current code in branch timertest is not hardened enough iff a module lags behind real time with its timer requests. It blocks in this situation:
. out_sdl2 now:ffff7686 desired:8363012a
main tnext.time=8363012a wait_until=ffff7686
. out_sdl2 now:ffff930c desired:836307ac
main tnext.time=836307ac wait_until=ffff930c
. out_sdl2 now:ffffb00a desired:83630e2e
main tnext.time=83630e2e wait_until=ffffb00a
. out_sdl2 now:ffffcc84 desired:836314b0
main tnext.time=836314b0 wait_until=ffffcc84
. out_sdl2 now:ffffe90b desired:83631b32
main tnext.time=83631b32 wait_until=ffffe90b
. out_sdl2 now:0000057e desired:836321b4
out_sdl2 sleeptime:0x83631c36 us = dez: 2204310 ms
out_sdl2 now:2be2a858 desired:836321b4
out_sdl2 sleeptime:0x5780795c us = dez: 1468037 ms
out_sdl2 now:2be2a88d desired:836321b4
out_sdl2 sleeptime:0x57807927 us = dez: 1468037 ms
The point here is now:0000057e desired:836321b4
, i.e. udate() flows over and the module requests a timer to stop at some point x with 0x50000000 < x < 0xA0000000.
A solution would be to make sure that no module wants to request a timer with a stop time < udate() (i.e. this means it needs to update its internal nexttick counter to udate() too!!!)
Replacing it with a udate() + frametime
would be okay.
A small fix would be to introduce a time
type. unsigned long long
or something like that. Would certainly make the time a bit longer before it hangs..
Hey @orithena, you ever got things working better?
Not conclusively. I haven't had the time yet.
I think we should encapsulate all timer related stuff in timers.h so that gfx-modules only need to call a function instead of maintaining the exact instant when they want to be called again. But I don't have a good idea on how to implement that yet.
Hey @orithena, you got any ideas at this point? :)
Some ideas:
Fixed* by #93.
I found a reason for lockup after letting sled run for over an hour on a 32-bit linux machine. Looks like a timer/counter overflow issue to me, because sled locks up (a.k.a. sleeps for waay too long) after roughly an hour (running
gfx_candyflow
without_sdl2
), which is consistent with 2^32 microseconds (= 4292s = 71.5m = 1.19h).Here is the full gdb
backtrace full
output. I did not dive into the timer code, so I do not fully understand what is going on here, buttnow
,sleeptime
anddesired_usec
looks wrong to me.The timer was set in the module using