lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
456 stars 38 forks source link

Consider using Atoma #263

Closed lemon24 closed 2 years ago

lemon24 commented 3 years ago

In light of various issues feedparser has (see #265), I think it's wise we consider other feed parser implementations to use.

In this issue, we'll look at https://github.com/NicolasLM/atoma; my comments [in brackets]:

Features:

  • RSS 2.0 - RSS 2.0 Specification
  • Atom Syndication Format v1 - RFC4287
  • JSON Feed v1 - JSON Feed specification [including v1.1]
  • OPML 2.0, to share lists of feeds - OPML 2.0
  • Typed: feeds decomposed into meaningful Python objects
  • Secure: uses defusedxml to load untrusted feeds [no plain etree, no lxml]
  • Compatible with Python 3.6+

Non-implemented Features:

  • XML signature and encryption [likely not needed]
  • Some Atom and RSS extensions [although feedparser may have more, I don't think reader uses them]
  • Atom content other than text, html and xhtml [likely OK]

Of note, it:

lemon24 commented 3 years ago

I did a comparison between feedparser and atoma, by parsing 157 feeds from disk.

atoma seems to be faster and consume significantly less memory (for a fair comparison, feedparser had both sanitization and relative link resolution disabled).

noop doesn't do anything with the feeds, to provide a baseline.

# impl time maxrss

# Ubuntu 20.04, Python 3.8.10

feedparser 9.0 61
atoma 1.5 28
noop 0.0 20

# macOS Catalina, Python 3.8.10

feedparser 14.5 56
atoma 2.3 29
noop 0.0 18

Unfortunately, atoma doesn't support some of the RSS feeds:

error: _feeds/https-blog-nelhage-com-atom-xml.atom: Could not parse feed: "rss" does not have a "feed:id"
error: _feeds/https-nedbatchelder-com-blog-rss-xml.rss: Cannot process RSS feed version "None"
error: _feeds/https-ciechanow-ski-atom-xml.atom: Could not parse feed: "rss" does not have a "feed:id"
error: _feeds/http-www-xn-8ws00zhy3a-com-feed.atom: EntitiesForbidden(name='xhtml', system_id=None, public_id=None)
error: _feeds/https-www-reddit-com-r-oilshell-rss.rss: Not a valid XML document
error: _feeds/https-blog-ncase-me-rss.rss: Cannot process RSS feed version "None"
error: _feeds/https-danluu-com-atom-xml.atom: Could not parse feed: "rss" does not have a "feed:id"
error: _feeds/https-blogs-dropbox-com-tech-feed.rss: Could not parse feed: "url" text is required but is empty

The EntitiesForbidden error is due using defusedxml (https://github.com/lemon24/reader/issues/212#issuecomment-886175089).

The script I used: ```python import sys, time, resource import feedparser, atoma def feedparser_parse(path, file): return feedparser.parse( file, resolve_relative_uris=False, sanitize_html=False, ) def atoma_parse(path, file): return getattr(atoma, f'parse_{path.rpartition(".")[2]}_file')(file) def noop_parse(*_): pass impl = sys.argv[1] parse = locals()[f'{impl}_parse'] timings = 0 for line in sys.stdin: path = line.rstrip() with open(path, 'rb') as file: try: start = time.perf_counter() parse(path, file) end = time.perf_counter() timings += end - start except Exception as e: print(f'error: {path}: {e}', file=sys.stderr) print( impl, round(timings, 1), int( resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 2 ** (20 if sys.platform == 'darwin' else 10) ), ) ```
lemon24 commented 2 years ago

Closing in favor of #265.