Closed TheLovinator1 closed 5 months ago
Thank you for the report!
Unfortunately, I can't seem to reproduce this neither on macOS nor on Linux (tried both the provided script and the CLI).
Based on the backslashes in the traceback, I gather this happened on a Windows machine. I'll report back when I get access to one (don't have one at hand at the moment).
I think I was the problem, I tried to reinstall reader before I made the issue but still had the same problem. I removed and created a new virtual environment and it works now.. Sorry for taking up your time
No worries!
$ python --version
Python 3.12.3
$ python -m reader --version
python -m reader 3.12
$ rm testboi.sqlite
rm: cannot remove 'testboi.sqlite': No such file or directory
$ python -m reader --db testboi.sqlite add https://abidlabs.github.io/feed.xml
$ python -m reader --db testboi.sqlite update -v
0 ok, 0 error, 0 not modified; entries: 0 new, 0 modified
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/__main__.py", line 15, in <module>
cli(prog_name='python -m reader')
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 83, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 107, in wrapper
rv = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 158, in wrapper
return fn(reader, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 359, in update
for result in bar:
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_cli.py", line 266, in iter_update_status
for i, result in enumerate(it):
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/core.py", line 975, in update_feeds_iter
yield from Pipeline.from_reader(self, map).update(filter)
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_update.py", line 388, in update
raise value
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_update.py", line 415, in process_parse_result
counts = self.update_feed(feed.url, *intents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_update.py", line 447, in update_feed
self.storage.add_or_update_entries(entries)
File "/usr/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_storage/_entries.py", line 205, in add_or_update_entries
self._add_or_update_entries(iterable)
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_storage/_entries.py", line 220, in _add_or_update_entries
self._update_entry(db, intent)
File "/home/lovinator/.cache/pypoetry/virtualenvs/readertest-UrSJG0TM-py3.12/lib/python3.12/site-packages/reader/_storage/_entries.py", line 274, in _update_entry
assert intent.first_updated is None, intent.first_updated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2024-05-21 22:21:50.848428+00:00
More feeds:
I have no idea if I gave you the wrong feeds earlier or if I am crazy but I got the problem again lol
I also get the error on my server that runs Arch so should not be a Windows problem
Finally https://abidlabs.github.io/feed.xml fails for me, I saved it locally and will look into the root cause.
I have no idea if I gave you the wrong feeds earlier or if I am crazy but I got the problem again lol
Not crazy :)) This looks like a weird one – at a quick reading of the code, this Should Not Happen™, but it obviously does.
git bisect says that 68baa2860618b2d4164bac3468ace4aa6fe36c29 is the first bad commit.
OK, the root cause is multiple entries with the same id in the same feed:
$ curl -sL https://abidlabs.github.io/feed.xml | grep -Eo '<id>[^<]+</id>' | sort | uniq -c | sort -rn | head -n3
3 <id>https://abidlabs.github.io/journal/2018/03/journal</id>
1 <id>https://abidlabs.github.io/uci-datasets</id>
1 <id>https://abidlabs.github.io/journal/2018/02/journal</id>
Prior to https://github.com/lemon24/reader/commit/68baa2860618b2d4164bac3468ace4aa6fe36c29, the insert or update is atomic.
After it, while from the point of view of the updater (the update intent) all entries are new, by the time the second entry is added to the database, the first one already is in the database (so, intent.first_updated being None is just says the entry was new when the updater checked, but does not guarantee it hasn't made it to the database since).
Now, bug aside, the question is: how should we treat duplicate ids?
For RSS, if the <guid>
is missing, we fall back to <link>
:
If no fallback is found, we raise:
... and end up skipping the entry:
However, I don't think anyone anticipated duplicate ids in the same feed – by definition, both RSS <guid>
and Atom <id>
are meant to be universally unique.
Now, bug aside, the question is: how should we treat duplicate ids?
"fall back to <link>
when the id is not unique" is not a (full) solution, since you need to be able to do so in a deterministic way. E.g. say you have a feed that limits entries to 10, and that entry 9 and entry 10 are duplicate ids; if a new entry gets added, old 10 will "fall of the end" of the feed, and you can't know 9 is duplicate anymore.
... although, entry_dedupe should(?) catch this and dedupe the entries accordingly (hopefully without any change).
@TheLovinator1, this should now be fixed (i.e. behave as it did pre-3.12); the fix will go out with 3.13. Thank you for reporting and helping debug this!
I have decided to not support duplicate entry ids for now, see the previous two comments and these notes for details.
Remaining work for this issue: Address this race condition once #332 is merged:
assert intent.first_updated is None, intent.first_updated
seems to get triggered for a lot of feeds. https://github.com/lemon24/reader/blob/25a7207edac09d1e9e78d2864a640e02738c9904/src/reader/_storage/_entries.py#L275Example feeds
Stack trace
Feeds seems to be valid (With some recommendations): https://validator.w3.org/feed/check.cgi?url=http%3A%2F%2F485i.com%2Ffeed%2F https://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fkevinkauzlaric.com%2Ffeed%2F
Example code