lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
444 stars 37 forks source link

AssertionError when running update_feeds() #335

Closed TheLovinator1 closed 3 months ago

TheLovinator1 commented 4 months ago

assert intent.first_updated is None, intent.first_updated seems to get triggered for a lot of feeds. https://github.com/lemon24/reader/blob/25a7207edac09d1e9e78d2864a640e02738c9904/src/reader/_storage/_entries.py#L275

Example feeds

Stack trace

Traceback (most recent call last):
  File "testboi.py", line 5, in <module>
    reader.update_feeds()
  File "\IEC0pvXw-py3.12\Lib\site-packages\reader\core.py", line 844, in update_feeds
    for url, value in results:
  File "\IEC0pvXw-py3.12\Lib\site-packages\reader\core.py", line 975, in update_feeds_iter
    yield from Pipeline.from_reader(self, map).update(filter)
  File "\IEC0pvXw-py3.12\Lib\site-packages\reader\_update.py", line 388, in update
    raise value
  File "\IEC0pvXw-py3.12\Lib\site-packages\reader\_update.py", line 415, in process_parse_result
    counts = self.update_feed(feed.url, *intents)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\IEC0pvXw-py3.12\Lib\site-packages\reader\_update.py", line 447, in update_feed
           ^^^^^^^^^^^^^^^^^^^
  File "\IEC0pvXw-py3.12\Lib\site-packages\reader\_storage\_entries.py", line 205, in add_or_update_entries
    self._add_or_update_entries(iterable)
  File "\IEC0pvXw-py3.12\Lib\site-packages\reader\_storage\_entries.py", line 220, in _add_or_update_entries
    self._update_entry(db, intent)
  File "\IEC0pvXw-py3.12\Lib\site-packages\reader\_storage\_entries.py", line 274, in _update_entry
    assert intent.first_updated is None, intent.first_updated
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2024-05-21 14:51:35.785094+00:00

Feeds seems to be valid (With some recommendations): https://validator.w3.org/feed/check.cgi?url=http%3A%2F%2F485i.com%2Ffeed%2F https://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fkevinkauzlaric.com%2Ffeed%2F

Example code

from reader import Reader, make_reader

reader: Reader = make_reader(url="testboi.sqlite")
reader.add_feed("http://485i.com/feed/")
reader.update_feeds()
lemon24 commented 4 months ago

Thank you for the report!

Unfortunately, I can't seem to reproduce this neither on macOS nor on Linux (tried both the provided script and the CLI).

(Repro attempt on macOS and Linux, using the CLI.) ```console $ sw_vers ProductName: Mac OS X ProductVersion: 10.15.7 BuildVersion: 19H2026 $ python --version Python 3.12.2 $ python -m reader --version python -m reader 3.13.dev0 $ git log --oneline | head -n1 25a7207 .gitattributes: always use crlf line endings for bat files. $ rm testboi.sqlite $ python -m reader --db testboi.sqlite add http://485i.com/feed/ $ python -m reader --db testboi.sqlite update -v 0:00:00.506136 0/1 http://485i.com/feed/ 2 new, 0 modified, 2 total 1 ok, 0 error, 0 not modified; entries: 2 new, 0 modified ``` ```console $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.3 LTS Release: 22.04 Codename: jammy $ python --version Python 3.10.12 $ python -m reader --version python -m reader 3.12 $ rm testboi.sqlite rm: cannot remove 'testboi.sqlite': No such file or directory $ python -m reader --db testboi.sqlite add http://485i.com/feed/ $ python -m reader --db testboi.sqlite update -v 0:00:00.585365 0/1 http://485i.com/feed/ 2 new, 0 modified, 2 total 1 ok, 0 error, 0 not modified; entries: 2 new, 0 modified ```

Based on the backslashes in the traceback, I gather this happened on a Windows machine. I'll report back when I get access to one (don't have one at hand at the moment).

TheLovinator1 commented 4 months ago

I think I was the problem, I tried to reinstall reader before I made the issue but still had the same problem. I removed and created a new virtual environment and it works now.. Sorry for taking up your time

lemon24 commented 4 months ago

No worries!

TheLovinator1 commented 4 months ago
$ python --version
Python 3.12.3
$ python -m reader --version
python -m reader 3.12
$  rm testboi.sqlite
rm: cannot remove 'testboi.sqlite': No such file or directory
$  python -m reader --db testboi.sqlite add https://abidlabs.github.io/feed.xml
$  python -m reader --db testboi.sqlite update -v
0 ok, 0 error, 0 not modified; entries: 0 new, 0 modified
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/__main__.py", line 15, in <module>
    cli(prog_name='python -m reader')
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 83, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 107, in wrapper
    rv = fn(*args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 158, in wrapper
    return fn(reader, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 359, in update
    for result in bar:
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 266, in iter_update_status
    for i, result in enumerate(it):
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/core.py", line 975, in update_feeds_iter
    yield from Pipeline.from_reader(self, map).update(filter)
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_update.py", line 388, in update
    raise value
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_update.py", line 415, in process_parse_result
    counts = self.update_feed(feed.url, *intents)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_update.py", line 447, in update_feed
    self.storage.add_or_update_entries(entries)
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_storage/_entries.py", line 205, in add_or_update_entries
    self._add_or_update_entries(iterable)
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_storage/_entries.py", line 220, in _add_or_update_entries
    self._update_entry(db, intent)
  File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_storage/_entries.py", line 274, in _update_entry
    assert intent.first_updated is None, intent.first_updated
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2024-05-21 22:21:50.848428+00:00

More feeds:

I have no idea if I gave you the wrong feeds earlier or if I am crazy but I got the problem again lol

I also get the error on my server that runs Arch so should not be a Windows problem

lemon24 commented 4 months ago

Finally https://abidlabs.github.io/feed.xml fails for me, I saved it locally and will look into the root cause.

I have no idea if I gave you the wrong feeds earlier or if I am crazy but I got the problem again lol

Not crazy :)) This looks like a weird one – at a quick reading of the code, this Should Not Happen™, but it obviously does.

lemon24 commented 4 months ago

git bisect says that 68baa2860618b2d4164bac3468ace4aa6fe36c29 is the first bad commit.

lemon24 commented 4 months ago

OK, the root cause is multiple entries with the same id in the same feed:

$ curl -sL https://abidlabs.github.io/feed.xml | grep -Eo '<id>[^<]+</id>' | sort | uniq -c | sort -rn | head -n3
   3 <id>https://abidlabs.github.io/journal/2018/03/journal</id>
   1 <id>https://abidlabs.github.io/uci-datasets</id>
   1 <id>https://abidlabs.github.io/journal/2018/02/journal</id>

Prior to https://github.com/lemon24/reader/commit/68baa2860618b2d4164bac3468ace4aa6fe36c29, the insert or update is atomic.

After it, while from the point of view of the updater (the update intent) all entries are new, by the time the second entry is added to the database, the first one already is in the database (so, intent.first_updated being None is just says the entry was new when the updater checked, but does not guarantee it hasn't made it to the database since).


Now, bug aside, the question is: how should we treat duplicate ids?

For RSS, if the <guid> is missing, we fall back to <link>:

https://github.com/lemon24/reader/blob/48c6bebc93e2c82471fad40ec2a9bdf8c5a1eb84/src/reader/_parser/feedparser.py#L132-L140

If no fallback is found, we raise:

https://github.com/lemon24/reader/blob/48c6bebc93e2c82471fad40ec2a9bdf8c5a1eb84/src/reader/_parser/feedparser.py#L142-L143

... and end up skipping the entry:

https://github.com/lemon24/reader/blob/48c6bebc93e2c82471fad40ec2a9bdf8c5a1eb84/src/reader/_parser/feedparser.py#L100-L105

However, I don't think anyone anticipated duplicate ids in the same feed – by definition, both RSS <guid> and Atom <id> are meant to be universally unique.

Interestingly, for some of those feeds, the duplicate entries point to the same article, while for some they point to different ones ಠ_ಠ ```python from collections import Counter from reader._parser import default_parser URLS = """\ http://485i.com/feed/ http://kevinkauzlaric.com/feed/ http://aaronshbeeb.com/feed/ https://feeds.feedburner.com/pyuric https://adriann.github.io/feed.rss https://abidlabs.github.io/feed.xml https://adventurist.me/feed.xml https://ajmalafif.com/rss.xml """.split() parser = default_parser() for url in URLS: links_by_id = {} try: result = parser(url) except Exception: continue for e in result.entries: links_by_id.setdefault(e.id, Counter()).update({e.link:1}) for id, links in links_by_id.items(): if sum(links.values()) <= 1: continue print(id) for link, count in links.items(): print(' ', count, link) ``` ``` https://adriann.github.io/learning_japanese.html 2 https://adriann.github.io/learning_japanese.html https://adriann.github.io/rust_parser.html 3 https://adriann.github.io/rust_parser.html https://abidlabs.github.io/journal/2018/03/journal 1 https://abidlabs.github.io/journal/2018/03/28/ 1 https://abidlabs.github.io/journal/2018/03/03/ 1 https://abidlabs.github.io/journal/2018/03/02/ https://adventurist.me/posts/00 2 https://adventurist.me/posts/00 https://adventurist.me/posts/0262 2 https://adventurist.me/posts/0262 https://ajmalafif.com/learn/build-fancy-landing-pages-with-react-three-fiber-and-threejs/ 4 https://ajmalafif.com/learn/build-fancy-landing-pages-with-react-three-fiber-and-threejs/ https://ajmalafif.com/learn/japanese/ 4 https://ajmalafif.com/learn/japanese/ https://ajmalafif.com/learn/the-joy-of-react/ 6 https://ajmalafif.com/learn/the-joy-of-react/ ```
lemon24 commented 4 months ago

Now, bug aside, the question is: how should we treat duplicate ids?

"fall back to <link> when the id is not unique" is not a (full) solution, since you need to be able to do so in a deterministic way. E.g. say you have a feed that limits entries to 10, and that entry 9 and entry 10 are duplicate ids; if a new entry gets added, old 10 will "fall of the end" of the feed, and you can't know 9 is duplicate anymore.

... although, entry_dedupe should(?) catch this and dedupe the entries accordingly (hopefully without any change).

lemon24 commented 4 months ago

@TheLovinator1, this should now be fixed (i.e. behave as it did pre-3.12); the fix will go out with 3.13. Thank you for reporting and helping debug this!

I have decided to not support duplicate entry ids for now, see the previous two comments and these notes for details.

Remaining work for this issue: Address this race condition once #332 is merged:

https://github.com/lemon24/reader/blob/788ddffe2d8d3e1fc989405ab573fd645977011b/src/reader/_storage/_entries.py#L220-L231