lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
438 stars 36 forks source link

get_entries(sort='recent') heuristic does not work with backdated entries #279

Closed lemon24 closed 2 years ago

lemon24 commented 2 years ago

As of 2.12, the recent entry sort heuristic is defined as such:

Most recent first. Currently, that means:

  • by import date [added] for entries published less than 7 days ago
  • by published date otherwise (if an entry does not have published, updated is used)

This is to make sure newly imported entries appear at the top regardless of when the feed says they were published (sometimes, it lies by a day or two).

For some feeds, entries can appear in the feed months after their published. Example: https://peps.python.org/peps.rss

It should be possible to opt in to see entries when they are added, regardless of published/updated.

Some avenues for doing this:

lemon24 commented 2 years ago

To do:

lemon24 commented 2 years ago

Did some benchmarks / optimizations:

I ran the benchmarks on my database:

python -u scripts/bench.py time -n1 -r16 --db db.sqlite --query elon \
    get_entries_all get_entries_read search_entries_recent_all
$ bench.py diff 3.0 noopt --format=percent-decrease | grep -e ^stat -e '^ min'
stat number repeat num_entries get_entries_all get_entries_read search_entries_recent_all
 min      1     16           0           15.7%            18.3%                      3.5%
$ bench.py diff 3.0 idprefix --format=percent-decrease | grep -e ^stat -e '^ min'
stat number repeat num_entries get_entries_all get_entries_read search_entries_recent_all
 min      1     16           0           25.6%            27.5%                      2.7%
$ bench.py diff 3.0 str --format=percent-decrease | grep -e ^stat -e '^ min'
stat number repeat num_entries get_entries_all get_entries_read search_entries_recent_all
 min      1     16           0           25.8%            27.8%                      1.8%

Web app results (on my laptop; on an EC2 instance they look similar):

            page generated in <min> (<avg>) 
            /?limit=64      /
3.0         .169 (~.20)     6.22 (~6.5)    
noopt       .240 (~.25)     6.13 (~6.3)
idprefix    .166 (~.19)     6.06 (~6.3)