Open lemon24 opened 3 years ago
https://gist.github.com/lemon24/10ae478fafb8fc1cb091f04e0ceec03f
Done so far:
Using an alternate SAX parser is just a matter of exposing the global list as an argument to parse(), so it doesn't need a proof of concept.
The plan:
I initially wanted to wrap feedparser (like in the gist above), and maybe publish that on PyPI. However, forking is wiser, because it increases the likelihood of actually getting the changes upstream, even if it happens in 6-12 months.
feedparser meta-issue: https://github.com/kurtmckee/feedparser/issues/296
feedparser PR for reducing memory usage: https://github.com/kurtmckee/feedparser/pull/302
maxrss for update_feeds() (1 worker), before/after ea64a42, on my database (~160 feeds), with all the feeds stale:
>>> def fn(before, after, base=0):
... return (1 - (after-base) / (before-base)) * 100
...
>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(76.8, 62.6)
18.489583333333325
>>> fn(76.8, 62.6, 28.8)
29.58333333333334
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(76.5, 57.9)
24.31372549019608
>>> fn(76.5, 57.9, 29.75)
39.7860962566845
Same, with all the feeds up-to-date:
>>> # 2013 MBP, macOS Catalina, Python 3.10.0
>>> fn(70.0, 55.8)
20.285714285714285
>>> fn(70.0, 55.8, 28.9)
34.54987834549878
>>> # t4g.nano instance, Ubuntu 20.04, Python 3.8.10
>>> fn(66.3, 52.3)
21.11613876319759
>>> fn(66.3, 52.3, 30.0)
38.56749311294766
feedparser has some issues I would like to solve for reader:
Can we work around them by re-implementing parse()?