Closed lemon24 closed 3 years ago
My initial idea of how to do this was to calculate the difference between entry publish times, and average that:
The custom aggregate version is better, since it's fast, it works now (SQLite 3.15), and it's more flexible (e.g. you can use other statistics function, like mean).
Update: This is overcomplicated. It's way easier to just calculate an average frequency, i.e. occurrences per unit of time; it also solves a few corner cases (e.g. 0 and 1 entries per time window).
With the simple implementation (entries per day), here's some stats about the average frequency for my current feeds:
count mean std min 50% 90% 99% max
-------- -------- ------ ------ ------ ------ ------ ------ ------
1 month 157.0000 0.0498 0.1501 0.0000 0.0000 0.1117 0.6256 1.4456
3 months 157.0000 0.0508 0.1063 0.0000 0.0110 0.1248 0.5222 0.7995
1 year 157.0000 0.0535 0.0999 0.0000 0.0137 0.1353 0.4416 0.7611
Looking at individual feeds, the 1, 3, and 12 month windows seem like the best choice; having just one doesn't really tell a story.
Assuming I'm going to represent them as a sort of sparkline, it makes sense to show them on a log scale. Based on one and two, I came up with this (it matches this curve):
>>> from math import log10
>>>
>>> def log_scale(n, c):
... return log10(n * c + 1) / log10(c + 1)
...
>>> log_scale(.011, 100) # 3-month median, 1 entry every ~90 days
0.1607622903934876
>>> log_scale(.0508, 100) # 3-month mean, 1 entry every ~20 days
0.39110673045077493
>>> log_scale(.1248, 100) # 3-month p90, 1 entry every ~8 days
0.5636271243604518
>>> log_scale(.5222, 100) # 3-month p99, 1 entry every ~2 days
0.8611767018967853
API time!
We'll add a new attribute to EntryCounts, a tuple subclass that also bundles the intervals as an attribute (initially, it can just be a 3-tuple; the point is we can make it more specific later).
Not sure if it's worth having the periods as a timedelta – to get number of entries back you'd have to do something like avgs[0] * (avgs.periods[0].total_seconds() / 3600 / 24)
.
from dataclasses import dataclass
from typing import Optional, Tuple, Union, Sequence, overload, cast
class Averages(Tuple[float, float, float]):
_periods: Tuple[float, float, float]
def __new__(
cls, _values: Sequence[float], *, periods: Sequence[float]
) -> 'Averages':
if not len(_values) == len(periods) == 3:
raise ValueError
# Argument 2 to "__new__" of "tuple" has incompatible type "Sequence[float]"; expected "Iterable[_T_co]"
rv = super().__new__(cls, _values) # type: ignore[arg-type]
rv._periods = cast(Tuple[float, float, float], tuple(periods))
return rv
@property
def periods(self) -> Tuple[float, float, float]:
# "immutable"
return self._periods
@property
def counts(self) -> Tuple[int, int, int]:
# nice to have
return cast(
Tuple[int, int, int],
tuple(round(f * p) for f, p in zip(self, self.periods))
)
def __repr__(self) -> str:
return f"{type(self).__qualname__}({super().__repr__()}, periods={self._periods!r})"
@dataclass(frozen=True)
class AveragesVariableLength(Sequence[float]):
# not used below, likely yagni
_values: Tuple[float, ...]
periods: Tuple[float, ...]
# this behaves like a tuple subclass with a periods attribute
# we sadly lose the info that values and periods are of a specific length;
# we could set them to a typevar bound to Tuple[float, ...],
# but idk how to mark the sequence stuff below as being of that length
# need the overloads per https://stackoverflow.com/a/46720499
@overload
def __getitem__(self, index: int) -> float: ...
@overload
def __getitem__(self, index: slice) -> Sequence[float]: ...
def __getitem__(self, index: Union[int, slice]) -> Union[float, Sequence[float]]:
return self._values[index]
def __len__(self) -> int:
return len(self._values)
@dataclass(frozen=True)
class EntryCounts:
total: Optional[int] = None
averages: Optional[Averages] = None
DEFAULT_DAYS = (30.0, 91.0, 365.0)
def get_entry_counts(averages_days: Tuple[float, ...] = DEFAULT_DAYS) -> EntryCounts:
# fake code to make sure typing works
# TODO: better name for averages_days
uf = tuple(1 / d for d in averages_days)
rv = EntryCounts(1, Averages(uf, periods=averages_days))
return rv
#x = get_entry_counts().averages
#x = get_entry_counts((30.0, 91, 365.0)).averages
x = get_entry_counts((30.0, int(91))).averages
if x is not None:
reveal_type(x)
reveal_type(x.periods)
# both should be a mypy error for the last get_entry_counts()
a, b, c = x
a, b, c = x.periods
Remaining work:
I'd like to have a measure of how often a feed gets updated, something like "mean time between updates".
This is useful for e.g. deleting feeds that update too often, or filtering feeds by it.
I'm sure this has been done in other feed readers, it's worth taking a look at how they did it.
Some thoughts:
One corner case we should treat in a useful way is a feed that had weekly updates until a year ago, but then stopped. A way of dealing with this is doing the calculation only for a window of time (e.g. 6 months); this would likely help with less drastic changes in frequency as well, and would prevent bad results for some entries that predate first_updated_epoch (for them, it defaults to 1970).
Another corner case is a feed with exactly 1 entry in the time window.
We could eliminate duplicates from the initial import of entries without published/updated by grouping by first_updated_epoch.
Another useful thing would be to compute this metric for an arbitrary selection of entries, e.g. by tag or a full-text search query (how often do we get "python" entries?).
Presumably, if we allow arbitrary selections, the data would be returned by the
{get,search}_entry_counts
(should've called it*_stats
).If we don't allow the time window to be configurable, we may want to have 3 different windows (1 month, 6 months, 1 year), a la system load averages. OTOH, YAGNI; we can have just one initially (likely, 1 year), and then alias that attribute to something else later.