test_parser.py fails most tests when running with vendored feedparser

maksverver commented 2 months ago

Too many failures to list them all, but here is a representative sample:

$ PYTHONPATH=src/ pytest tests/test_parser.py -k test_feed_root_empty
================================================================================== test session starts ==================================================================================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0
rootdir: /home/maks/tmp/reader
configfile: pyproject.toml
plugins: typeguard-4.3.0, subtests-0.12.1, requests-mock-1.11.0
collected 278 items / 270 deselected / 8 selected                                                                                                                                       

tests/test_parser.py FFFFFFss                                                                                                                                                     [100%]

======================================================================================= FAILURES ========================================================================================
_____________________________________________________________________________ test_feed_root_empty[False-] ______________________________________________________________________________

data_dir = PosixPath('/home/maks/tmp/reader/tests/data'), scheme = '', relative = False

    @pytest.mark.parametrize('scheme', ['', 'file:', 'file:///', 'file://localhost/'])
    @pytest.mark.parametrize('relative', [False, True])
    def test_feed_root_empty(data_dir, scheme, relative):
        # TODO: this test looks a lot like test_feed_root_nonempty

        if relative and scheme.startswith('file://'):
            pytest.skip("can't have relative URIs with 'file://...'")

        parse = default_parser('')

        # we know this returns the right thing based on all of the tests above
        good_path = data_dir.joinpath('full.rss')
        good_url = str(good_path)
>       good_result = parse(good_url)

tests/test_parser.py:726: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/reader/_parser/__init__.py:120: in __call__
    return self._parser(url, http_etag, http_last_modified)
src/reader/_parser/_lazy.py:186: in __call__
    raise result
src/reader/_parser/_lazy.py:149: in parallel
    yield feed, self.parse(feed.url, result)
src/reader/_parser/_lazy.py:285: in parse
    feed, entries = parser(url, result.resource, result.headers)
src/reader/_parser/feedparser.py:60: in __call__
    return _process_feed(url, result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

url = '/home/maks/tmp/reader/tests/data/full.rss'
d = {'bozo': False, 'entries': [], 'feed': {}, 'headers': {}, 'content-type': '', 'encoding': 'utf-8', 'version': '', 'namespaces': {}}

    def _process_feed(url: str, d: Any) -> tuple[FeedData, list[EntryData]]:
        if d.get('bozo'):
            exception = d.get('bozo_exception')
            if isinstance(exception, _SURVIVABLE_EXCEPTION_TYPES):
                log.warning("parse %s: got %r", url, exception)
            else:
                raise ParseError(url, message="error while parsing feed") from exception

        if not d.version:
>           raise ParseError(url, message="unknown feed type")
E           reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'

src/reader/_parser/feedparser.py:79: ParseError
___________________________________________________________________________ test_feed_root_empty[False-file:] ___________________________________________________________________________

data_dir = PosixPath('/home/maks/tmp/reader/tests/data'), scheme = 'file:', relative = False

    @pytest.mark.parametrize('scheme', ['', 'file:', 'file:///', 'file://localhost/'])
    @pytest.mark.parametrize('relative', [False, True])
    def test_feed_root_empty(data_dir, scheme, relative):
        # TODO: this test looks a lot like test_feed_root_nonempty

        if relative and scheme.startswith('file://'):
            pytest.skip("can't have relative URIs with 'file://...'")

        parse = default_parser('')

        # we know this returns the right thing based on all of the tests above
        good_path = data_dir.joinpath('full.rss')
        good_url = str(good_path)
>       good_result = parse(good_url)

tests/test_parser.py:726: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/reader/_parser/__init__.py:120: in __call__
    return self._parser(url, http_etag, http_last_modified)
src/reader/_parser/_lazy.py:186: in __call__
    raise result
src/reader/_parser/_lazy.py:149: in parallel
    yield feed, self.parse(feed.url, result)
src/reader/_parser/_lazy.py:285: in parse
    feed, entries = parser(url, result.resource, result.headers)
src/reader/_parser/feedparser.py:60: in __call__
    return _process_feed(url, result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

url = '/home/maks/tmp/reader/tests/data/full.rss'
d = {'bozo': False, 'entries': [], 'feed': {}, 'headers': {}, 'content-type': '', 'encoding': 'utf-8', 'version': '', 'namespaces': {}}

    def _process_feed(url: str, d: Any) -> tuple[FeedData, list[EntryData]]:
        if d.get('bozo'):
            exception = d.get('bozo_exception')
            if isinstance(exception, _SURVIVABLE_EXCEPTION_TYPES):
                log.warning("parse %s: got %r", url, exception)
            else:
                raise ParseError(url, message="error while parsing feed") from exception

        if not d.version:
>           raise ParseError(url, message="unknown feed type")
E           reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'

src/reader/_parser/feedparser.py:79: ParseError
_________________________________________________________________________ test_feed_root_empty[False-file:///] __________________________________________________________________________

data_dir = PosixPath('/home/maks/tmp/reader/tests/data'), scheme = 'file:///', relative = False

    @pytest.mark.parametrize('scheme', ['', 'file:', 'file:///', 'file://localhost/'])
    @pytest.mark.parametrize('relative', [False, True])
    def test_feed_root_empty(data_dir, scheme, relative):
        # TODO: this test looks a lot like test_feed_root_nonempty

        if relative and scheme.startswith('file://'):
            pytest.skip("can't have relative URIs with 'file://...'")

        parse = default_parser('')

        # we know this returns the right thing based on all of the tests above
        good_path = data_dir.joinpath('full.rss')
        good_url = str(good_path)
>       good_result = parse(good_url)

tests/test_parser.py:726: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/reader/_parser/__init__.py:120: in __call__
    return self._parser(url, http_etag, http_last_modified)
src/reader/_parser/_lazy.py:186: in __call__
    raise result
src/reader/_parser/_lazy.py:149: in parallel
    yield feed, self.parse(feed.url, result)
src/reader/_parser/_lazy.py:285: in parse
    feed, entries = parser(url, result.resource, result.headers)
src/reader/_parser/feedparser.py:60: in __call__
    return _process_feed(url, result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

url = '/home/maks/tmp/reader/tests/data/full.rss'
d = {'bozo': False, 'entries': [], 'feed': {}, 'headers': {}, 'content-type': '', 'encoding': 'utf-8', 'version': '', 'namespaces': {}}

    def _process_feed(url: str, d: Any) -> tuple[FeedData, list[EntryData]]:
        if d.get('bozo'):
            exception = d.get('bozo_exception')
            if isinstance(exception, _SURVIVABLE_EXCEPTION_TYPES):
                log.warning("parse %s: got %r", url, exception)
            else:
                raise ParseError(url, message="error while parsing feed") from exception

        if not d.version:
>           raise ParseError(url, message="unknown feed type")
E           reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'

src/reader/_parser/feedparser.py:79: ParseError
_____________________________________________________________________ test_feed_root_empty[False-file://localhost/] _____________________________________________________________________

data_dir = PosixPath('/home/maks/tmp/reader/tests/data'), scheme = 'file://localhost/', relative = False

    @pytest.mark.parametrize('scheme', ['', 'file:', 'file:///', 'file://localhost/'])
    @pytest.mark.parametrize('relative', [False, True])
    def test_feed_root_empty(data_dir, scheme, relative):
        # TODO: this test looks a lot like test_feed_root_nonempty

        if relative and scheme.startswith('file://'):
            pytest.skip("can't have relative URIs with 'file://...'")

        parse = default_parser('')

        # we know this returns the right thing based on all of the tests above
        good_path = data_dir.joinpath('full.rss')
        good_url = str(good_path)
>       good_result = parse(good_url)

tests/test_parser.py:726: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/reader/_parser/__init__.py:120: in __call__
    return self._parser(url, http_etag, http_last_modified)
src/reader/_parser/_lazy.py:186: in __call__
    raise result
src/reader/_parser/_lazy.py:149: in parallel
    yield feed, self.parse(feed.url, result)
src/reader/_parser/_lazy.py:285: in parse
    feed, entries = parser(url, result.resource, result.headers)
src/reader/_parser/feedparser.py:60: in __call__
    return _process_feed(url, result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

url = '/home/maks/tmp/reader/tests/data/full.rss'
d = {'bozo': False, 'entries': [], 'feed': {}, 'headers': {}, 'content-type': '', 'encoding': 'utf-8', 'version': '', 'namespaces': {}}

    def _process_feed(url: str, d: Any) -> tuple[FeedData, list[EntryData]]:
        if d.get('bozo'):
            exception = d.get('bozo_exception')
            if isinstance(exception, _SURVIVABLE_EXCEPTION_TYPES):
                log.warning("parse %s: got %r", url, exception)
            else:
                raise ParseError(url, message="error while parsing feed") from exception

        if not d.version:
>           raise ParseError(url, message="unknown feed type")
E           reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'

src/reader/_parser/feedparser.py:79: ParseError
______________________________________________________________________________ test_feed_root_empty[True-] ______________________________________________________________________________

data_dir = PosixPath('/home/maks/tmp/reader/tests/data'), scheme = '', relative = True

    @pytest.mark.parametrize('scheme', ['', 'file:', 'file:///', 'file://localhost/'])
    @pytest.mark.parametrize('relative', [False, True])
    def test_feed_root_empty(data_dir, scheme, relative):
        # TODO: this test looks a lot like test_feed_root_nonempty

        if relative and scheme.startswith('file://'):
            pytest.skip("can't have relative URIs with 'file://...'")

        parse = default_parser('')

        # we know this returns the right thing based on all of the tests above
        good_path = data_dir.joinpath('full.rss')
        good_url = str(good_path)
>       good_result = parse(good_url)

tests/test_parser.py:726: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/reader/_parser/__init__.py:120: in __call__
    return self._parser(url, http_etag, http_last_modified)
src/reader/_parser/_lazy.py:186: in __call__
    raise result
src/reader/_parser/_lazy.py:149: in parallel
    yield feed, self.parse(feed.url, result)
src/reader/_parser/_lazy.py:285: in parse
    feed, entries = parser(url, result.resource, result.headers)
src/reader/_parser/feedparser.py:60: in __call__
    return _process_feed(url, result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

url = '/home/maks/tmp/reader/tests/data/full.rss'
d = {'bozo': False, 'entries': [], 'feed': {}, 'headers': {}, 'content-type': '', 'encoding': 'utf-8', 'version': '', 'namespaces': {}}

    def _process_feed(url: str, d: Any) -> tuple[FeedData, list[EntryData]]:
        if d.get('bozo'):
            exception = d.get('bozo_exception')
            if isinstance(exception, _SURVIVABLE_EXCEPTION_TYPES):
                log.warning("parse %s: got %r", url, exception)
            else:
                raise ParseError(url, message="error while parsing feed") from exception

        if not d.version:
>           raise ParseError(url, message="unknown feed type")
E           reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'

src/reader/_parser/feedparser.py:79: ParseError
___________________________________________________________________________ test_feed_root_empty[True-file:] ____________________________________________________________________________

data_dir = PosixPath('/home/maks/tmp/reader/tests/data'), scheme = 'file:', relative = True

    @pytest.mark.parametrize('scheme', ['', 'file:', 'file:///', 'file://localhost/'])
    @pytest.mark.parametrize('relative', [False, True])
    def test_feed_root_empty(data_dir, scheme, relative):
        # TODO: this test looks a lot like test_feed_root_nonempty

        if relative and scheme.startswith('file://'):
            pytest.skip("can't have relative URIs with 'file://...'")

        parse = default_parser('')

        # we know this returns the right thing based on all of the tests above
        good_path = data_dir.joinpath('full.rss')
        good_url = str(good_path)
>       good_result = parse(good_url)

tests/test_parser.py:726: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/reader/_parser/__init__.py:120: in __call__
    return self._parser(url, http_etag, http_last_modified)
src/reader/_parser/_lazy.py:186: in __call__
    raise result
src/reader/_parser/_lazy.py:149: in parallel
    yield feed, self.parse(feed.url, result)
src/reader/_parser/_lazy.py:285: in parse
    feed, entries = parser(url, result.resource, result.headers)
src/reader/_parser/feedparser.py:60: in __call__
    return _process_feed(url, result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

url = '/home/maks/tmp/reader/tests/data/full.rss'
d = {'bozo': False, 'entries': [], 'feed': {}, 'headers': {}, 'content-type': '', 'encoding': 'utf-8', 'version': '', 'namespaces': {}}

    def _process_feed(url: str, d: Any) -> tuple[FeedData, list[EntryData]]:
        if d.get('bozo'):
            exception = d.get('bozo_exception')
            if isinstance(exception, _SURVIVABLE_EXCEPTION_TYPES):
                log.warning("parse %s: got %r", url, exception)
            else:
                raise ParseError(url, message="error while parsing feed") from exception

        if not d.version:
>           raise ParseError(url, message="unknown feed type")
E           reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'

src/reader/_parser/feedparser.py:79: ParseError
================================================================================ short test summary info ================================================================================
FAILED tests/test_parser.py::test_feed_root_empty[False-] - reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'
FAILED tests/test_parser.py::test_feed_root_empty[False-file:] - reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'
FAILED tests/test_parser.py::test_feed_root_empty[False-file:///] - reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'
FAILED tests/test_parser.py::test_feed_root_empty[False-file://localhost/] - reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'
FAILED tests/test_parser.py::test_feed_root_empty[True-] - reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'
FAILED tests/test_parser.py::test_feed_root_empty[True-file:] - reader.exceptions.ParseError: unknown feed type: '/home/maks/tmp/reader/tests/data/full.rss'
===================================================================== 6 failed, 2 skipped, 270 deselected in 0.31s ======================================================================
xmlPythonFileRead: result is not a String
xmlPythonFileRead: result is not a String
xmlPythonFileRead: result is not a String
xmlPythonFileRead: result is not a String
xmlPythonFileRead: result is not a String
xmlPythonFileRead: result is not a String

It looks like no XML feeds (neither RSS nor Atom) are recognized. Deleting the src/reader/_vendor/ subdirectory (and applying this patch to update the imports) fixes the issue.

I'm running on Arch Linux, using system packages. Installed Python package versions:

$ python -c '\
import importlib.metadata as im
for d in im.distributions():
    print(d.name, d.version)
' | sort
absl-py 2.1.0
acme 2.11.0
annotated-types 0.7.0
attrs 23.2.1.dev0
autocommand 2.2.2
beautifulsoup4 4.12.3
blinker 1.7.0
boolean.py 4.0
btrfsutil 6.10
build 1.2.1
certbot 2.11.0
certbot-nginx 2.11.0
certifi 2024.7.4
cffi 1.16.0
chardet 5.2.0
charset-normalizer 3.3.2
click 8.1.7
ConfigArgParse 1.5.5
configobj 5.0.8
cryptography 42.0.7
Cython 3.0.11
distro 1.9.0
dnspython 2.6.1
fastjsonschema 2.20.0
feedparser 6.0.11
Flask 2.3.3
future 1.0.0
humanize 4.9.0
idna 3.7
inflect 7.3.1
iniconfig 2.0.0
installer 0.7.0
iotop 0.6
iso8601 2.1.0
itsdangerous 2.1.2
jaraco.context 5.3.0
jaraco.functools 4.0.2
jaraco.text 4.0.0
Jinja2 3.1.4
josepy 1.14.0
license-expression 30.3.1.dev0+gc20b3f6.d20240601
lxml 5.3.0
Markdown 3.7
MarkupSafe 2.1.5
MechanicalSoup 1.3.0
more-itertools 10.3.0
mutagen 1.47.0
namcap 3.5.2
numpy 2.0.1
ordered-set 4.1.0
packaging 24.1
parsedatetime 2.6
perf 0.1
pillow 10.4.0
platformdirs 4.2.2
pluggy 1.5.0
ply 3.11
psutil 6.0.0
pyalpm 0.10.6
pybind11 2.13.4
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.20.0
pydantic 2.8.2
pydantic_core 2.20.1
pyelftools 0.31
pygdbmi 0.11.0.0
pyOpenSSL 24.2.1
pyparsing 3.1.2
pyproject_hooks 1.1.0
pyRFC3339 1.1
pytest 8.3.2
pytest-subtests 0.12.1
pytz 2024.1
PyYAML 6.0.2
reader 3.14
requests 2.32.3
requests-mock 1.11.0
requests-toolbelt 1.0.0
requests-wsgi-adapter 0.4.1
sentry-sdk 2.13.0
setuptools 69.5.1
setuptools-scm 8.1.0
sgmllib3k 1.0.0
simplejson 3.19.2
six 1.16.0
smbus 1.1
soupsieve 2.6
speedtest-cli 2.1.3
tomli 2.0.1
trove-classifiers 2024.7.22
typeguard 4.3.0
typing_extensions 4.12.2
urllib3 1.26.19
validate 5.0.8
validate-pyproject 0.19
Werkzeug 3.0.1
wheel 0.44.0

maksverver commented 2 months ago

I managed to hunt down the root cause of the parser failures. It's caused by this part of the vendored feedparser code (which is different from the latest official feedparser release): https://github.com/lemon24/reader/blob/63e8ec34bb6fd7c68a0e2ffa4178634947472d5c/src/reader/_vendor/feedparser/api.py#L320-L332

Apparently, source.setCharacterStream(stream_factory.get_text_file()) causes the call to saxparser.parse(source) to fail, with an error message xmlPythonFileRead: result is not a String printed to stdout (which comes from libxml2 here). (You need to run pytest -s to see this error message, since pytest swallows output by default.)

There is a secondary problem in that the error is not detected because the code here tries to catch an xml.sax.SAXException exception, but parse() will not raise exceptions when an error handler is installed, which is done here) so in reality that except block is never executed and the error (that is stored in feedparser.exc) will be ignored.

I don't know why the setCharacterStream() stuff doesn't work, but I can fix the parser tests simply by removing it:

--- a/src/reader/_vendor/feedparser/api.py
+++ b/src/reader/_vendor/feedparser/api.py
@@ -316,13 +316,7 @@ def _parse_file_inplace(
         saxparser.setContentHandler(feed_parser)
         saxparser.setErrorHandler(feed_parser)  # type: ignore[arg-type]
         source = xml.sax.xmlreader.InputSource()
-
-        # If an encoding was detected, decode the file on the fly;
-        # otherwise, pass it as-is and let the SAX parser deal with it.
-        try:
-            source.setCharacterStream(stream_factory.get_text_file())
-        except MissingEncoding:
-            source.setByteStream(stream_factory.get_binary_file())
+        source.setByteStream(stream_factory.get_binary_file())

         try:
             saxparser.parse(source)

So I think there are two action items here:

Figure out why setCharacterStream() doesn't work as intended.
Fix the error handling around the subsequent parse() call.

lemon24 commented 2 months ago

Great debugging work!

Funnily enough, the api.py code you link above was last touched by myself in https://github.com/kurtmckee/feedparser/pull/302; that PR was merged into feedparser/develop some time ago, so if it is that code that has the problem, we'll have to fix it upstream.

I should note that I am (obviously) not seeing the xmlPythonFileRead: result is not a String error...

Re. error handling in feedparser.parse():

There is a secondary problem ... parse() will not raise exceptions when an error handler is installed ... so in reality that except block is never executed and the error ... will be ignored.

Specifically, I think the following sequence happens:

StrictXMLParser.error() stores the error on self (the parser object), but does not reraise
because no exception is raised, the feed_parser.exc is never copied from the parser to the result bozo_exception
the LooseFeedParser is then tried, which cannot parse the thing, so no version is set (but no bozo/bozo_exception is set either)
reader._parser.feedparser does not see any bozo_exception, and then does not see any version, so it errors with unknown feed type

I will assume the same kind of shadowing happens before my https://github.com/kurtmckee/feedparser/pull/302 PR, but it is never apparent because only the setByteStream branch exists (setCharacterStream is never used):

https://github.com/kurtmckee/feedparser/blob/e21be2051a31aa272276dfc9af08ecab2094e966/feedparser/api.py#L280-L286

Re. why setCharacterStream() doesn't work as intended:

I will try to trace that issue more carefully later (ran out of time for today), but one possible root cause is that your Python ends up with different ifdefs than "mainstream" ones.

Meanwhile, can you please help me confirm the text file is indeed returning strings by adding the following to reader._vendor.feedparser.api line 321, and running one of the failing tests with -s?

file =  stream_factory.get_text_file()
print('one:', repr(file.read()[:40]))
print('two:', repr(file.read())
raise Exception('failing intentionally')

(If both of those are strings, it may be reasonable to assume your libxml2 only supports bytes / binary files.)

maksverver commented 2 months ago

I think you're right in your analysis except for this:

the LooseFeedParser is then tried

This doesn't happen, because if no exception is raised, then use_strict_parser = False is never executed, and then the loose parser doesn't get a chance to run (since it's gated by if not use_strict_parser), which is why you end up with an empty feed and no error even though parsing failed.

That snippet prints:

one: "<?xml version='1.0' encoding='utf-8'?>\n<"
two: ''

which works as expected because the first call consumes all the input. But something weird happens if I change the code to:

    print('one:', repr(file.read(40)))
    print('two:', repr(file.read()))

Then it prints:

one: "<?xml version='1.0' encoding='utf-8'?>\n<"
two: '<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<rss version="2.0">\n<channel>\n    <title>RSS Title</title>\n ..

(I truncated the second line.) Note that the first call returns 40 characters as expected, but the second read call returns the entire file including the first 40 bytes. That's not how read() is supposed to behave; the final read() call should return just the remaining text.

I created a pull request to fix this upstream: https://github.com/kurtmckee/feedparser/pull/469, but I don't think it's the root cause of the problem here.

maksverver commented 2 months ago

So I hunted the problem down to libxml2. It looks like it doesn't support text input at all. Here's a test case that you can use to reproduce the "result is not a String" error: https://gist.github.com/maksverver/7ec9221f163070cbd98f4f38a3932036

@lemon24 can you check which XML parser is being used on your system? Is it libxml2 or something else? If libxml2, which version? (I'm using version 2.13.3.)

maksverver commented 2 months ago

I tracked down the issue to libxml2 and filed an upstream issue, which was promptly fixed: https://gitlab.gnome.org/GNOME/libxml2/-/issues/790

The summary is that the current version of feedparser (which calls source.setCharacterStream()) does not work when using libxml2 as the parser implementation. This is the preferred implementation by feedparser, but I'm guessing it's not installed on your Ubuntu installation, which is why these tests pass. (When the preferred implementation is not available, Python falls back to another implementation, probably Expat in this case.)

So I think the issue is resolved when a new libxml2 release contains the fix.

That does leave the other issue I mentioned: fix the error handling around the subsequent parse() call, since the current catch clause doesn't actually catch any errors.

lemon24 commented 2 months ago

@lemon24 can you check which XML parser is being used on your system? Is it libxml2 or something else? If libxml2, which version? (I'm using version 2.13.3.)

not installed on your Ubuntu installation ... Python falls back to another implementation, probably Expat in this case

Indeed, xml.sax.expatreader.ExpatParser, on both my old-ish Mac and on an Ubuntu 22.04.

lemon24 commented 2 months ago

@maksverver thank you for following up in many different places!

I will try to review the remaining feedparser PR by the end of the weekend to help out. Once merged into develop, I can update the vendored version too.

lemon24 commented 2 months ago

So I think the issue is resolved when a new libxml2 release contains the fix.

@maksverver, are you OK until then?

I was going to suggest monkey-patching feedparser.api.PREFERRED_XML_PARSERS, but that's made a bit more complicated by me vendoring it. I am reluctant to monkeypatch feedparser itself in reader, since other things may be using it, but there should be no issue with monkeypatching the vendored version. (Users can make this choice, since in theory they have more visibility into the deployment environment.)

Possible mitigations on reader side (easiest first):

Expose the feedparser module used by reader in a well known location, to make monkeypatching easier (also helps with #212). It will likely be reader._parser.feedparser.feedparser, it just needs to be documented.
reader itself changes PREFERRED_XML_PARSERS in vendored feedparser depending on the availability of new enough libxml2.

maksverver commented 2 months ago

I'm good for now. I plan to wait for the libxml2 patch to be released before I update the AUR package to use the vendored version of feedparser. (If there are any bugs in the 6.0.11 release I haven't run into them yet.)

It would be cool if at some point the feedparser fixes are released so you don't need to bundle the development version with reader. I dislike having different versions of the same library on my system; it often leads to difficult-to-debug problems.

I do find it strange that feedparser seems to prefers libxml2 via PREFERRED_XML_PARSERS, yet all the tests run against Expat only. This seems like a recipe for trouble, if there are bugs on systems with libxml2 installed, they won't be detected by the tests. My preference would be to run the tests against both expat and libxml2 to make sure both work as intended, but I don't know how easy that would be to setup.

maksverver commented 1 month ago

libxml2 2.13.4 was released two weeks ago, so this is no longer an issue :)

lemon24 commented 1 month ago

It would be cool if at some point the feedparser fixes are released

Don't disagree, but it is what it is :)

Reopening this so I can add an environment variable to allow using "system" (non-vendored) feedparser; having no escape hatch for dealing with issues like this one sits a bit wrong with me.

(In theory I could do away with the vendored one entirely and recommend users do pip install 'feedparser @ https://github.com/kurtmckee/feedparser/archive/refs/heads/develop.zip', but I think there's value in getting it by default when you pip install reader.)

lemon24 / reader

test_parser.py fails most tests when running with vendored feedparser #350