lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
438 stars 36 forks source link

I don't know how often a feed gets updated #249

Closed lemon24 closed 3 years ago

lemon24 commented 3 years ago

I'd like to have a measure of how often a feed gets updated, something like "mean time between updates".

This is useful for e.g. deleting feeds that update too often, or filtering feeds by it.

I'm sure this has been done in other feed readers, it's worth taking a look at how they did it.


Some thoughts:

One corner case we should treat in a useful way is a feed that had weekly updates until a year ago, but then stopped. A way of dealing with this is doing the calculation only for a window of time (e.g. 6 months); this would likely help with less drastic changes in frequency as well, and would prevent bad results for some entries that predate first_updated_epoch (for them, it defaults to 1970).

Another corner case is a feed with exactly 1 entry in the time window.

We could eliminate duplicates from the initial import of entries without published/updated by grouping by first_updated_epoch.

Another useful thing would be to compute this metric for an arbitrary selection of entries, e.g. by tag or a full-text search query (how often do we get "python" entries?).

Presumably, if we allow arbitrary selections, the data would be returned by the {get,search}_entry_counts (should've called it *_stats).

If we don't allow the time window to be configurable, we may want to have 3 different windows (1 month, 6 months, 1 year), a la system load averages. OTOH, YAGNI; we can have just one initially (likely, 1 year), and then alias that attribute to something else later.

lemon24 commented 3 years ago

My initial idea of how to do this was to calculate the difference between entry publish times, and average that:

Here's a query that doesn't use window functions (they were added to SQLite in 3.25); running it for all the entries in the past year takes 28s. There are also some timings for a window function version; it's much faster. ```sql -- https://stackoverflow.com/a/55779275 with updated_full as ( select coalesce(published, updated, first_updated_epoch) as fu, --first_updated_epoch as fu, id, feed from entries where feed like '%99p%' and fu >= datetime('now', '-12 months') order by fu, id, feed ), updated as ( select count(*) as sequence, current.fu as fu from updated_full as current left join updated_full as others where current.fu >= others.fu group by current.fu, current.id ), updated_skip_one as ( select sequence - 1 as sequence, fu from updated limit -1 offset 1 ), pairs as ( select updated.fu as one, updated_skip_one.fu as two from updated join updated_skip_one using (sequence) ) select avg(strftime('%s', two) - strftime('%s', one)) / (3600 * 24) from pairs; -- whole db 12m (28s): 348.875750942319 kfu (?!), 0.0659692288411146 fu -- 99p 12m (.011s): 6.14506353930461 kfu, 6.28819444444444 fu -- using "select ROW_NUMBER() over () as sequence" in updated instead: -- whole db 12m (~.050s): 0.109319573689652 kfu, 0.0659692288411146 fu -- 99p 12m (.005s): 6.14506353930461 kfu, 6.14540816326531 fu ```
Here's a version using a custom aggregate function; timings are comparable to the window function version. ```python from reader import make_reader from datetime import datetime import sqlite3 import time # based on https://github.com/lemon24/reader/blob/0.22/src/reader/core/storage.py#L203-L212 def _datetime_to_s(value): if not value: return None if not isinstance(value, bytes): value = value.encode('utf-8') dt = sqlite3.converters['TIMESTAMP'](value) rv = (dt - datetime(1970, 1, 1)).total_seconds() return rv missing = object() class AvgOfDifferences: def __init__(self): self.prev = missing self.count = 0 self.total = 0 def step(self, value): value = _datetime_to_s(value) if self.prev is missing: self.prev = value return self.count += 1 self.total += (value - self.prev) self.prev = value def finalize(self): return self.total / self.count db = make_reader('db.sqlite')._storage.db db.create_aggregate("avg_of_differences", 1, AvgOfDifferences) start = time.perf_counter() rows = list(db.execute(""" with updated as ( select coalesce(published, updated, first_updated_epoch) as fu --first_updated_epoch as fu from entries where feed like '%99p%' and fu >= datetime('now', '-12 months') order by fu, id, feed ) select avg_of_differences(fu) / (3600 * 24) from updated """)) end = time.perf_counter() print(*rows, end - start) """ -- using python: -- whole db 12m (~.035s): 0.10931957368965244 kfu, 0.06596922992339282 fu -- 99p 12m (.014s): 6.14506353930461 kfu, 6.14540816913313 fu """ ```

The custom aggregate version is better, since it's fast, it works now (SQLite 3.15), and it's more flexible (e.g. you can use other statistics function, like mean).

Update: This is overcomplicated. It's way easier to just calculate an average frequency, i.e. occurrences per unit of time; it also solves a few corner cases (e.g. 0 and 1 entries per time window).

lemon24 commented 3 years ago

With the simple implementation (entries per day), here's some stats about the average frequency for my current feeds:

             count    mean     std     min     50%     90%     99%     max
--------  --------  ------  ------  ------  ------  ------  ------  ------
1 month   157.0000  0.0498  0.1501  0.0000  0.0000  0.1117  0.6256  1.4456
3 months  157.0000  0.0508  0.1063  0.0000  0.0110  0.1248  0.5222  0.7995
1 year    157.0000  0.0535  0.0999  0.0000  0.0137  0.1353  0.4416  0.7611

Looking at individual feeds, the 1, 3, and 12 month windows seem like the best choice; having just one doesn't really tell a story.

Assuming I'm going to represent them as a sort of sparkline, it makes sense to show them on a log scale. Based on one and two, I came up with this (it matches this curve):

>>> from math import log10
>>> 
>>> def log_scale(n, c):
...     return log10(n * c + 1) / log10(c + 1)
... 
>>> log_scale(.011, 100)  # 3-month median, 1 entry every ~90 days
0.1607622903934876
>>> log_scale(.0508, 100)  # 3-month mean, 1 entry every ~20 days
0.39110673045077493
>>> log_scale(.1248, 100)  # 3-month p90, 1 entry every ~8 days
0.5636271243604518
>>> log_scale(.5222, 100)  # 3-month p99, 1 entry every ~2 days
0.8611767018967853
lemon24 commented 3 years ago

API time!

We'll add a new attribute to EntryCounts, a tuple subclass that also bundles the intervals as an attribute (initially, it can just be a 3-tuple; the point is we can make it more specific later).

Not sure if it's worth having the periods as a timedelta – to get number of entries back you'd have to do something like avgs[0] * (avgs.periods[0].total_seconds() / 3600 / 24).

from dataclasses import dataclass
from typing import Optional, Tuple, Union, Sequence, overload, cast

class Averages(Tuple[float, float, float]):

    _periods: Tuple[float, float, float]

    def __new__(
        cls, _values: Sequence[float], *, periods: Sequence[float]
    ) -> 'Averages':
        if not len(_values) == len(periods) == 3:
            raise ValueError
        # Argument 2 to "__new__" of "tuple" has incompatible type "Sequence[float]"; expected "Iterable[_T_co]"
        rv = super().__new__(cls, _values)  # type: ignore[arg-type]
        rv._periods = cast(Tuple[float, float, float], tuple(periods))
        return rv

    @property
    def periods(self) -> Tuple[float, float, float]:
        # "immutable"
        return self._periods

    @property
    def counts(self) -> Tuple[int, int, int]:
        # nice to have
        return cast(
            Tuple[int, int, int], 
            tuple(round(f * p) for f, p in zip(self, self.periods))
        )

    def __repr__(self) -> str:
        return f"{type(self).__qualname__}({super().__repr__()}, periods={self._periods!r})"

@dataclass(frozen=True)
class AveragesVariableLength(Sequence[float]):

    # not used below, likely yagni

    _values: Tuple[float, ...]
    periods: Tuple[float, ...]

    # this behaves like a tuple subclass with a periods attribute

    # we sadly lose the info that values and periods are of a specific length;
    # we could set them to a typevar bound to Tuple[float, ...],
    # but idk how to mark the sequence stuff below as being of that length

    # need the overloads per https://stackoverflow.com/a/46720499

    @overload
    def __getitem__(self, index: int) -> float: ...

    @overload
    def __getitem__(self, index: slice) -> Sequence[float]: ...

    def __getitem__(self, index: Union[int, slice]) -> Union[float, Sequence[float]]:
        return self._values[index]

    def __len__(self) -> int:
        return len(self._values)

@dataclass(frozen=True)
class EntryCounts:
    total: Optional[int] = None
    averages: Optional[Averages] = None

DEFAULT_DAYS = (30.0, 91.0, 365.0)

def get_entry_counts(averages_days: Tuple[float, ...] = DEFAULT_DAYS) -> EntryCounts:
    # fake code to make sure typing works
    # TODO: better name for averages_days
    uf = tuple(1 / d for d in averages_days)
    rv = EntryCounts(1, Averages(uf, periods=averages_days))
    return rv

#x = get_entry_counts().averages
#x = get_entry_counts((30.0, 91, 365.0)).averages
x = get_entry_counts((30.0, int(91))).averages
if x is not None:
    reveal_type(x)
    reveal_type(x.periods)
    # both should be a mypy error for the last get_entry_counts()
    a, b, c = x
    a, b, c = x.periods
Here's an abandoned example with a slightly different API that makes EntryCounts generic on the type (length) of the time window tuple (might be useful if later). ```python from dataclasses import dataclass from typing import Optional, TypeVar, Generic, cast, overload TF = TypeVar('TF', bound=Tuple[float, ...]) @dataclass(frozen=True) class EntryCounts(Generic[TF]): total: Optional[int] = None update_frequency: Optional[TF] = None DEFAULT_DAYS = (30.0, 91.0, 365.0) # the overload is required due to python/mypy#3737; # however, the implementation "cannot produce return type of signature"; # we can quiet this down with the "type: ignore[misc]", # but it removes type checking for the entire function; # to still have type checking, we move the code to another function, # that doesn't have default arguments: # https://github.com/python/mypy/issues/3737#issuecomment-465355737 @overload def get_entry_counts() -> EntryCounts[Tuple[float, float, float]]: ... @overload def get_entry_counts(update_frequency_days: TF) -> EntryCounts[TF]: ... def get_entry_counts( # type: ignore[misc] update_frequency_days: Tuple[float, ...] = DEFAULT_DAYS ) -> EntryCounts[Tuple[float, ...]]: return _get_entry_counts(update_frequency_days) def _get_entry_counts(update_frequency_days: TF) -> EntryCounts[TF]: # fake code to make sure typing works uf = tuple(1 / d for d in update_frequency_days) rv = EntryCounts(1, cast(TF, uf)) return rv # this is still somewhat wrong in that for # get_entry_counts((30.0, int(91))) # the inferred type is EntryCounts[Tuple[builtins.float, builtins.int*]]; # https://www.python.org/dev/peps/pep-0646/ might help; # OTOH, we could just annotate the return be Tuple[float, ...] (no length info) ```
lemon24 commented 3 years ago

Remaining work: