lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
438 stars 36 forks source link

Consider supporting alternative feed parsers #264

Closed lemon24 closed 2 years ago

lemon24 commented 2 years ago

In light of various issues feedparser has (see #265), I think it's wise we consider other feed parser implementations to use.

In this issue, we'll:

lemon24 commented 2 years ago

The logical pipeline of parsing a feed:

Currently:

lemon24 commented 2 years ago

I've pretty much decided to continue using feedparser (https://github.com/lemon24/reader/issues/265#issuecomment-981671759) and not switching to Atoma (https://github.com/lemon24/reader/issues/263), but it's worth documenting the factors that went into it.

I looked at feedparser 6.0.8, and Atoma 0.0.17.

feedparser Atoma
stable yes no (0.x)
maintainer responsiveness low high
format detection yes yes (tries to parse all formats)
JSON feed no yes
old feed formats yes no
Atom/RSS extensions medium high
file objects yes yes (no autodetection)
memory usage high (reads feed in memory multiple times) medium (builds whole etree)
typed no yes
safe XML no yes (defusedxml)
pluggable XML parser (defusedxml, lxml) no (yes with global/monkeypatching) no
bad encodings yes no
malformed feeds yes no
relative link resolution yes (can be disabled, exposes XML base) no
HTML sanitization yes (can be disabled) no
unified feed/entry interface yes no
lemon24 commented 2 years ago

Closing in favor of #265.