Consider supporting alternative feed parsers

lemon24 commented 3 years ago

In light of various issues feedparser has (see #265), I think it's wise we consider other feed parser implementations to use.

In this issue, we'll:

lemon24 commented 3 years ago

The logical pipeline of parsing a feed:

detect encoding (from headers or stream)
detect xml:base (from headers; needed for relative link resolution – #125)
detect high-level format (XML, JSON)
parse XML/JSON stream into intermediary generic Python data structure (ElementTree, JSON dict)
- can be secure or not (#212)
- can support malformed high-level markup (both feedparser and lxml can do this; broken XML does exist in the wild)
- store xml:base
detect feed format (RSS, Atom, JSON Feed)
convert generic Python data structure into feed data structure
resolve relative links
- optional
sanitize content (https://github.com/lemon24/reader/issues/125#issuecomment-522333200, #227)
- optional
unify feed data structure (so it looks the same regardless of feed format)

Currently:

feedparser goes directly from stream to feed data structure by using xml.sax.
- lxml has a way of converting an etree into sax events (https://github.com/lemon24/reader/issues/212#issuecomment-956418094).
For JSON Feed, reader only relies on the inferred MIME type; feedparser does some sniffing to detect the feed format, and we rely on that (that is, reader has no logic to tell RSS apart from Atom etc.).
Both relative link resolution (requires xml:base) and content sanitization can happen before or after storage; feedparser does them before storage, and I'm not sure if we can use it to do it after (the things it uses probably aren't stable, and they tie into the sax parsing logic).

lemon24 commented 2 years ago

I've pretty much decided to continue using feedparser (https://github.com/lemon24/reader/issues/265#issuecomment-981671759) and not switching to Atoma (https://github.com/lemon24/reader/issues/263), but it's worth documenting the factors that went into it.

I looked at feedparser 6.0.8, and Atoma 0.0.17.

feedparser	Atoma
stable	yes	no (0.x)
maintainer responsiveness	low	high
format detection	yes	yes (tries to parse all formats)
JSON feed	no	yes
old feed formats	yes	no
Atom/RSS extensions	medium	high
file objects	yes	yes (no autodetection)
memory usage	high (reads feed in memory multiple times)	medium (builds whole etree)
typed	no	yes
safe XML	no	yes (defusedxml)
pluggable XML parser (defusedxml, lxml)	no (yes with global/monkeypatching)	no
bad encodings	yes	no
malformed feeds	yes	no
relative link resolution	yes (can be disabled, exposes XML base)	no
HTML sanitization	yes (can be disabled)	no
unified feed/entry interface	yes	no

lemon24 commented 2 years ago

Closing in favor of #265.

lemon24 / reader