lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
438 stars 36 forks source link

Consider working around feedparser's issues #265

Open lemon24 opened 2 years ago

lemon24 commented 2 years ago

feedparser has some issues I would like to solve for reader:

Can we work around them by re-implementing parse()?

lemon24 commented 2 years ago

https://gist.github.com/lemon24/10ae478fafb8fc1cb091f04e0ceec03f

Done so far:

Using an alternate SAX parser is just a matter of exposing the global list as an argument to parse(), so it doesn't need a proof of concept.

lemon24 commented 2 years ago

The plan:

I initially wanted to wrap feedparser (like in the gist above), and maybe publish that on PyPI. However, forking is wiser, because it increases the likelihood of actually getting the changes upstream, even if it happens in 6-12 months.

lemon24 commented 2 years ago

feedparser meta-issue: https://github.com/kurtmckee/feedparser/issues/296

lemon24 commented 2 years ago

feedparser PR for reducing memory usage: https://github.com/kurtmckee/feedparser/pull/302

lemon24 commented 2 years ago

maxrss for update_feeds() (1 worker), before/after ea64a42, on my database (~160 feeds), with all the feeds stale:

>>> def fn(before, after, base=0):
...     return (1 - (after-base) / (before-base)) * 100
... 
>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(76.8, 62.6)
18.489583333333325
>>> fn(76.8, 62.6, 28.8)
29.58333333333334
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(76.5, 57.9)
24.31372549019608
>>> fn(76.5, 57.9, 29.75)
39.7860962566845

Same, with all the feeds up-to-date:

>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(70.0, 55.8)
20.285714285714285
>>> fn(70.0, 55.8, 28.9)
34.54987834549878
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(66.3, 52.3)
21.11613876319759
>>> fn(66.3, 52.3, 30.0)
38.56749311294766
Memory measurements obtained with this script: ```python import sys, time, resource def get_maxrss(): return ( resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 2 ** (20 if sys.platform == 'darwin' else 10) ) print(f"before import: {get_maxrss():.1f}") from reader import make_reader print(f"before make_reader(): {get_maxrss():.1f}") reader = make_reader("db.sqlite") print(f"before update_feeds(): {get_maxrss():.1f}") start = time.perf_counter() reader.update_feeds() end = time.perf_counter() timing = end - start print(f"after update_feeds(): {get_maxrss():.1f} (took {timing:.1f}s)") ```