verigak / progress

Easy to use progress bars for Python
ISC License
1.41k stars 179 forks source link

Added property long_eta #59

Open awiebe opened 6 years ago

awiebe commented 6 years ago

For tasks that run hours not seconds, a total average eta generally becomes more stable and accurate as the task progresses.

A small running average does the unstable "windows file transfer" sort of estimate where the estimated time is wildly unstable, whereas a long running average starts unstable and then converges to an ok approximation.

praiskup commented 6 years ago

IMO, the bug in this project is that the moving average window is too short (sometimes only several ms) -- the higher the next() call frequency is, the worse results the progress bar shows. Correct moving average calculation should have some minimal time window period to give "stable" results (see e.g. https://github.com/python-progress/python-progress where the window is 2 seconds by default, and decide whether you need the long_eta).

That said, "long_eta" calculated across the whole period (hours) would be rarely useful; the upload/download process is dynamic and the speed changes dramatically -- giving the user incorrect estimation. The useful info is "what has been the average speed in last few seconds".

awiebe commented 6 years ago

@praiskup Actually for large batch jobs, the long running average produces much better estimation than even an extended window. As you noted the subtasks may not necessarily uniform in time, however if you take an average over a long period this smooths out.

Model the progress as a scatter plot, sampling the "remaining" at intervals t. Actually this particular method doesn't even require that the intervals be regularly spaced. What you wind up with is a trapezoid that probably overestimates most of the time, but is acceptable for a rough time of a long running job.

Right now I have a big data job, and the estimate is using this logic. It looks like this. | | 1.6% - 9 days, 9:59:48 It's fairly stable and I know that it will take about 9 days 10 hours.

That being said there are some better schemes that can be made using the long estimate, which will converge to a result faster. I actually made some better logic that uses a weighted shift to prevent spikes from moving the estimate too much. But the logic for that was more involved, so I didn't put it in this particular PR.

For example, I'm using a 60/40 weighted split:

    def __init(self)__
        self.prev_long_avg=3600
    def long_eta(self):
        avg = self.elapsed/(self.index+1)
        prev=self.prev_long_avg
        avg= (prev*0.6) + (avg*0.4)
        self.prev_long_avg=avg
        return int(ceil(avg * self.remaining))

Note that this simple linear extrapolation is a time/memory tradeoff. Not too much time and memory should be spent calculating progress, so even though calculating a linear regression might be the way to go, the manner in which long_eta works is acceptable for long tasks and scales O(1) unlike a regression that scales O(n) in time and space. Thus this is a reasonable compromise. If this PR is accepted, I might try making a class SmartBar(Bar), which chooses an estimation method to display based on available information.

praiskup commented 6 years ago

Ah, I thought that you want fix the broken "moving average" speed calculation by taking "average" speed into account, but you have different (pretty corner) use-case in my opinion.

Unless you process something which has "uniform" speed over the whole period, you never want to take absolute "average" into account, otherwise it gives you very misleading info. And if you do, you probably don't want to use "simple moving average" calculator/project. But I can also accept that it is matter of taste.

I'm +1 for additional functionality - since I see that it is useful for you, but if progress should support both calculations now - I'd suggest you to make clear distinction between "moving" average and normal average in documentation and in property names, so it is absolutely clear what is what. Seeing "avg_long" in "simple moving average" project isn't descriptive enough.