Document the decision on <=1s timeout with `select` (follow-up to #145)

hartwork commented 7 months ago

Regression from #145

@edgar-bonet this will not be popular, but please hear me out:

Our source the system clock updates at least once a second, and the current code sleeps up to one second. Because of these two combined, the user may see skips in seconds in the clock display, e.g. jumping from :03 straight to :05 or the clock taking longer than one second to update, i.e. running too slowly. Sleeping for a full second will never be right, and sleeping for 999e3 microseconds will not either, without really knowing how much time drawing takes in the worst case. I should have catched that during review, I'm sorry I didn't. Sleeping for 500ms not only was simple, it also didn't have this problem and it does give the CPU quite some air, it did not have a true problem that needed fixing. In the interest of both correctness and simplicity I would like to revert to where we were before, let's admit to ourselves that we took things too far. Please consider supporting this move. Thank you!

edgar-bonet commented 7 months ago

I don't see where there could be a problem. Could you please try to provide a detailed scenario where the clock display would jump from :03 to :05?

hartwork commented 7 months ago

@edgar-bonet I'm not sure I construct a case for the skipping — may not be possible —, but sleeping for 1 second is not right, we're going slower than the source then, and a round trip to from select to select is not free. The user may not be able to see it, but it's wrong at the core, at math. Let us at least replace the microseconds_remaining / microseconds_per_second by 0 to be less wrong.

edgar-bonet commented 7 months ago

Sorry, I still cannot see anything wrong. It looks to me like you are not comfortable with the idea of “current iteration time”, then you call it “wrong”.

I suggest you really try to build a scenario where the clock misbehaves. If you succeed, then we fix it. If you fail, the mere act of trying is likely to give you some insight, so the effort would not be wasted.

Now, one reason I dislike this PR is that the current code is all structured around the concept of event loop, and this PR goes against that concept. The core idea of the event loop is that we let the process sleep until there is an event that needs to be handled. “The clock display is out of date” is a valid event, which warrants waking up the process. We don't just wake up regularly “just in case”. In fact, the very first recommendation of the Select tutorial is:

You should always try to use select() without a timeout. Your program should have nothing to do if there is no data available. Code that depends on timeouts is not usually portable and is difficult to debug.

Obviously, we cannot apply this recommendation verbatim – we depend on timeouts to detect the “clock display is out of date” event. But we can nonetheless follow its spirit and sleep until there is an actual event to process.

hartwork commented 7 months ago

@edgar-bonet this will be unusual reply.

Let me first clear up something, and then move in your direction. Sounds good? Let's go!

My (I believe) strongest argument about status quo after #145 is that sleeping for a full second (in the general case) is too much if we want to produce a clock that "runs at the same speed" as the system clock which "appears to change value once per second at the resolution of interest". If we allow a sleep of one full second, the same rendering may sit there for too long, and we would have to guarantee that we never sleep that long at "the wrong moment" if our mission was a most accurate display. With regard to skipping ":04" between ":03" and ":05": if we sleep at "the right moment" we could sleep up to near 2 seconds and not miss the ":04" beat, but we would be late for display of ":04" by almost a second when drawing. So skipping is not a concern when one round trip from select to select takes very little time, but delay in general is. Which brings me to the second half of the reply:

Little time. I did some measurements in the mean time — two kinds —, and looked at this problem from a different angle as a result, by accident in some sense.

I measured the round trip time from select to select and it was even lower than I expected: even with signal handling and drawing combined I never saw more than 25 milliseconds total per round, rather less. That's the first observation.

I measured another thing: use of CPU time (through getrusage). I ran the 500ms version and the on-full-second version both for 10 seconds, then sent SIGINT to the process and looked at how much user and system time each had spent. The picture was consistent, like this:

# git branch --show-current && make >/dev/null && timeout -s INT 1000 ./ttyplot
measure-sleep-fixed-500ms <--
  user time: 0.39610200 seconds
system time: 0.07582100 seconds
               ^^ 
# git branch --show-current && make >/dev/null && timeout -s INT 1000 ./ttyplot 
measure-sleep-on-full-second <--
  user time: 0.22159500 seconds
system time: 0.01250900 seconds
               ^^

So the CPU usage was consistently lower for the on-full-second edition to a measurable point, I tried multiple times.

Now combining these two observations I come to the following conclusion. If we look at the loop as three diffs where time is passed:

   select
d1
   gettimeofday
d2
   draw
d3
   select

Then we know that d1 is <= 1 seconds and d2 + d3 is <= 25 milliseconds in practice, then we know that d1 + d2 + d3 is <= 1.025 seconds and that the human eye will not see more delay in the clock movement than 25 milliseconds.

Now that 25 milliseconds is on the edge of what human vision can perceive, we can make a conscious decision(!) here that we trade 25 milliseconds delay for reduced power consumption, since we call select and the whole loop only half the time of what we would do with 500ms timeout fixed. So now that is a true motive (that still needs to beat strive for simplicity and all, again conscious decision and tradeoff).

I have changed the pull request now to explain the non-obvious so that the next generation that wants to bring 500ms sleeping back has a chance at an informed decision.

I have kept the branch name but have adjusted the pull request title, I hope that's good enough on the "envelope" side of things.

Let me know what you think and how you like the new commit :beers:

hartwork commented 7 months ago

👍

Note that, if this is still bothering you, the approach of #144 could be recycled. If we encapsulate the logic in a dedicated function, and we avoid carrying extra state, and we don't update now twice per iteration... then the main loop stays simple and sweet.

@edgar-bonet I see two key ideas combined here:

Extracting a function
Defensive protection against skips/jumps in display

Extracting the function will get some low-level bits out of the event loop, so that's nice. With regard to protection against skips, I think the question is whether round trip…

select
gettimeofday
draw
select (again)

…can ever take a full second without getting to museum level hardware. If we can think of ways it can and have access to hardware that slow for testing, the defensive addition may help, if not it will just be zombie code stealing the CPU for everyone. Do you see a chance of that loop taking a full second? If yes, our whole "only a few milliseconds" assumption falls apart and then I would want us to say goodbye to sleeping full seconds altogether as a first priority. I think I'd vote for "yes" for extracting the function as a new pull request if you like and "no" for adding another check against lats painted second at the moment. What do you think?

// Precondition: the current iteration time (now) has already been displayed.

That sentence seems to be mis-leading to me because "already" could mean anywhere and arbitrary long time ago. We could maybe say that "now has been last updated between the last call to select and prior to the last paint (if any)" to be precise.

edgar-bonet commented 7 months ago

have access to hardware that slow for testing

I have an Acer Aspire One A110L (low-end netbook from 2008) and a first-generation Raspberry Pi from 2013. They could qualify as “slow”, at least relative to current hardware.

Do you see a chance of that loop taking a full second?

The point of being non-blocking is to always keep this short. That being said, if the computer is very heavily overloaded, I guess it can happen. But then the whole system would be slowed down to the point of being unusable, and a clock being laggy would be the least of the user's worries.

I think I'd vote for "yes" for extracting the function as a new pull request if you like and "no" for adding another check against lats painted second at the moment. What do you think?

Not sure I understood...

hartwork commented 7 months ago

I think I'd vote for "yes" for extracting the function as a new pull request if you like and "no" for adding another check against lats painted second at the moment. What do you think?

Not sure I understood...

@edgar-bonet I was trying to say that I would vote for extracting a function without adding an additional call to gettimeofday, a true and pure refactoring.

tenox7 / ttyplot

Document the decision on <=1s timeout with `select` (follow-up to #145) #146