Closed maksverver closed 3 weeks ago
I managed to hunt down the root cause of the parser failures. It's caused by this part of the vendored feedparser code (which is different from the latest official feedparser release): https://github.com/lemon24/reader/blob/63e8ec34bb6fd7c68a0e2ffa4178634947472d5c/src/reader/_vendor/feedparser/api.py#L320-L332
Apparently, source.setCharacterStream(stream_factory.get_text_file())
causes the call to saxparser.parse(source)
to fail, with an error message xmlPythonFileRead: result is not a String
printed to stdout (which comes from libxml2 here). (You need to run pytest -s
to see this error message, since pytest swallows output by default.)
There is a secondary problem in that the error is not detected because the code here tries to catch an xml.sax.SAXException
exception, but parse() will not raise exceptions when an error handler is installed, which is done here) so in reality that except
block is never executed and the error (that is stored in feedparser.exc
) will be ignored.
I don't know why the setCharacterStream()
stuff doesn't work, but I can fix the parser tests simply by removing it:
--- a/src/reader/_vendor/feedparser/api.py
+++ b/src/reader/_vendor/feedparser/api.py
@@ -316,13 +316,7 @@ def _parse_file_inplace(
saxparser.setContentHandler(feed_parser)
saxparser.setErrorHandler(feed_parser) # type: ignore[arg-type]
source = xml.sax.xmlreader.InputSource()
-
- # If an encoding was detected, decode the file on the fly;
- # otherwise, pass it as-is and let the SAX parser deal with it.
- try:
- source.setCharacterStream(stream_factory.get_text_file())
- except MissingEncoding:
- source.setByteStream(stream_factory.get_binary_file())
+ source.setByteStream(stream_factory.get_binary_file())
try:
saxparser.parse(source)
So I think there are two action items here:
Great debugging work!
Funnily enough, the api.py code you link above was last touched by myself in https://github.com/kurtmckee/feedparser/pull/302; that PR was merged into feedparser/develop some time ago, so if it is that code that has the problem, we'll have to fix it upstream.
I should note that I am (obviously) not seeing the xmlPythonFileRead: result is not a String
error...
Re. error handling in feedparser.parse():
There is a secondary problem ... parse() will not raise exceptions when an error handler is installed ... so in reality that except block is never executed and the error ... will be ignored.
Specifically, I think the following sequence happens:
unknown feed type
I will assume the same kind of shadowing happens before my https://github.com/kurtmckee/feedparser/pull/302 PR, but it is never apparent because only the setByteStream branch exists (setCharacterStream is never used):
Re. why setCharacterStream() doesn't work as intended:
I will try to trace that issue more carefully later (ran out of time for today), but one possible root cause is that your Python ends up with different ifdef
s than "mainstream" ones.
Meanwhile, can you please help me confirm the text file is indeed returning strings by adding the following to reader._vendor.feedparser.api line 321, and running one of the failing tests with -s
?
file = stream_factory.get_text_file()
print('one:', repr(file.read()[:40]))
print('two:', repr(file.read())
raise Exception('failing intentionally')
(If both of those are strings, it may be reasonable to assume your libxml2 only supports bytes / binary files.)
I think you're right in your analysis except for this:
the LooseFeedParser is then tried
This doesn't happen, because if no exception is raised, then use_strict_parser = False
is never executed, and then the loose parser doesn't get a chance to run (since it's gated by if not use_strict_parser
), which is why you end up with an empty feed and no error even though parsing failed.
That snippet prints:
one: "<?xml version='1.0' encoding='utf-8'?>\n<"
two: ''
which works as expected because the first call consumes all the input. But something weird happens if I change the code to:
print('one:', repr(file.read(40)))
print('two:', repr(file.read()))
Then it prints:
one: "<?xml version='1.0' encoding='utf-8'?>\n<"
two: '<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<rss version="2.0">\n<channel>\n <title>RSS Title</title>\n ..
(I truncated the second line.) Note that the first call returns 40 characters as expected, but the second read call returns the entire file including the first 40 bytes. That's not how read() is supposed to behave; the final read() call should return just the remaining text.
I created a pull request to fix this upstream: https://github.com/kurtmckee/feedparser/pull/469, but I don't think it's the root cause of the problem here.
So I hunted the problem down to libxml2. It looks like it doesn't support text input at all. Here's a test case that you can use to reproduce the "result is not a String" error: https://gist.github.com/maksverver/7ec9221f163070cbd98f4f38a3932036
@lemon24 can you check which XML parser is being used on your system? Is it libxml2 or something else? If libxml2, which version? (I'm using version 2.13.3.)
I tracked down the issue to libxml2 and filed an upstream issue, which was promptly fixed: https://gitlab.gnome.org/GNOME/libxml2/-/issues/790
The summary is that the current version of feedparser (which calls source.setCharacterStream()
) does not work when using libxml2 as the parser implementation. This is the preferred implementation by feedparser, but I'm guessing it's not installed on your Ubuntu installation, which is why these tests pass. (When the preferred implementation is not available, Python falls back to another implementation, probably Expat in this case.)
So I think the issue is resolved when a new libxml2 release contains the fix.
That does leave the other issue I mentioned: fix the error handling around the subsequent parse() call, since the current catch
clause doesn't actually catch any errors.
@lemon24 can you check which XML parser is being used on your system? Is it libxml2 or something else? If libxml2, which version? (I'm using version 2.13.3.)
not installed on your Ubuntu installation ... Python falls back to another implementation, probably Expat in this case
Indeed, xml.sax.expatreader.ExpatParser, on both my old-ish Mac and on an Ubuntu 22.04.
@maksverver thank you for following up in many different places!
I will try to review the remaining feedparser PR by the end of the weekend to help out. Once merged into develop, I can update the vendored version too.
So I think the issue is resolved when a new libxml2 release contains the fix.
@maksverver, are you OK until then?
I was going to suggest monkey-patching feedparser.api.PREFERRED_XML_PARSERS, but that's made a bit more complicated by me vendoring it. I am reluctant to monkeypatch feedparser itself in reader, since other things may be using it, but there should be no issue with monkeypatching the vendored version. (Users can make this choice, since in theory they have more visibility into the deployment environment.)
Possible mitigations on reader side (easiest first):
reader._parser.feedparser.feedparser
, it just needs to be documented.I'm good for now. I plan to wait for the libxml2 patch to be released before I update the AUR package to use the vendored version of feedparser. (If there are any bugs in the 6.0.11 release I haven't run into them yet.)
It would be cool if at some point the feedparser fixes are released so you don't need to bundle the development version with reader. I dislike having different versions of the same library on my system; it often leads to difficult-to-debug problems.
I do find it strange that feedparser seems to prefers libxml2 via PREFERRED_XML_PARSERS, yet all the tests run against Expat only. This seems like a recipe for trouble, if there are bugs on systems with libxml2 installed, they won't be detected by the tests. My preference would be to run the tests against both expat and libxml2 to make sure both work as intended, but I don't know how easy that would be to setup.
libxml2 2.13.4 was released two weeks ago, so this is no longer an issue :)
It would be cool if at some point the feedparser fixes are released
Don't disagree, but it is what it is :)
Reopening this so I can add an environment variable to allow using "system" (non-vendored) feedparser; having no escape hatch for dealing with issues like this one sits a bit wrong with me.
(In theory I could do away with the vendored one entirely and recommend users do pip install 'feedparser @ https://github.com/kurtmckee/feedparser/archive/refs/heads/develop.zip'
, but I think there's value in getting it by default when you pip install reader
.)
Too many failures to list them all, but here is a representative sample:
It looks like no XML feeds (neither RSS nor Atom) are recognized. Deleting the src/reader/_vendor/ subdirectory (and applying this patch to update the imports) fixes the issue.
I'm running on Arch Linux, using system packages. Installed Python package versions: