Consider using Atoma - Githubissues

lemon24 commented 3 years ago

In light of various issues feedparser has (see #265), I think it's wise we consider other feed parser implementations to use.

In this issue, we'll look at https://github.com/NicolasLM/atoma; my comments [in brackets]:

Features:

RSS 2.0 - RSS 2.0 Specification

Atom Syndication Format v1 - RFC4287

JSON Feed v1 - JSON Feed specification [including v1.1]

OPML 2.0, to share lists of feeds - OPML 2.0

Typed: feeds decomposed into meaningful Python objects

Secure: uses defusedxml to load untrusted feeds [no plain etree, no lxml]

Compatible with Python 3.6+

Non-implemented Features:

XML signature and encryption [likely not needed]

Some Atom and RSS extensions [although feedparser may have more, I don't think reader uses them]

Atom content other than text, html and xhtml [likely OK]

Of note, it:

Seems actively developed.
Supports passing open files (but you need to know what type of feed you have for this).
Does not do sanitization.
Does not support relative link resolution (nor does it expose base).
Does not support various kinds of malformed feeds (see comment below).
Seems to use much less memory.

lemon24 commented 3 years ago

I did a comparison between feedparser and atoma, by parsing 157 feeds from disk.

atoma seems to be faster and consume significantly less memory (for a fair comparison, feedparser had both sanitization and relative link resolution disabled).

noop doesn't do anything with the feeds, to provide a baseline.

# impl time maxrss

# Ubuntu 20.04, Python 3.8.10

feedparser 9.0 61
atoma 1.5 28
noop 0.0 20

# macOS Catalina, Python 3.8.10

feedparser 14.5 56
atoma 2.3 29
noop 0.0 18

Unfortunately, atoma doesn't support some of the RSS feeds:

error: _feeds/https-blog-nelhage-com-atom-xml.atom: Could not parse feed: "rss" does not have a "feed:id"
error: _feeds/https-nedbatchelder-com-blog-rss-xml.rss: Cannot process RSS feed version "None"
error: _feeds/https-ciechanow-ski-atom-xml.atom: Could not parse feed: "rss" does not have a "feed:id"
error: _feeds/http-www-xn-8ws00zhy3a-com-feed.atom: EntitiesForbidden(name='xhtml', system_id=None, public_id=None)
error: _feeds/https-www-reddit-com-r-oilshell-rss.rss: Not a valid XML document
error: _feeds/https-blog-ncase-me-rss.rss: Cannot process RSS feed version "None"
error: _feeds/https-danluu-com-atom-xml.atom: Could not parse feed: "rss" does not have a "feed:id"
error: _feeds/https-blogs-dropbox-com-tech-feed.rss: Could not parse feed: "url" text is required but is empty

The EntitiesForbidden error is due using defusedxml (https://github.com/lemon24/reader/issues/212#issuecomment-886175089).

The script I used:

```python import sys, time, resource import feedparser, atoma def feedparser_parse(path, file): return feedparser.parse( file, resolve_relative_uris=False, sanitize_html=False, ) def atoma_parse(path, file): return getattr(atoma, f'parse_{path.rpartition(".")[2]}_file')(file) def noop_parse(*_): pass impl = sys.argv[1] parse = locals()[f'{impl}_parse'] timings = 0 for line in sys.stdin: path = line.rstrip() with open(path, 'rb') as file: try: start = time.perf_counter() parse(path, file) end = time.perf_counter() timings += end - start except Exception as e: print(f'error: {path}: {e}', file=sys.stderr) print( impl, round(timings, 1), int( resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 2 ** (20 if sys.platform == 'darwin' else 10) ), ) ```

lemon24 commented 2 years ago

Closing in favor of #265.

lemon24 / reader

Consider using Atoma #263