Open lemon24 opened 5 years ago
This:
import reader
r = reader.Reader(':memory:')
r.add_feed('https://rachelbythebay.com/w/atom.xml')
r.update_feeds()
e, = [e for e in r.get_entries() if '2019/08/01/reliability' in e.link]
print(e.content[0].value.split('<p>')[1].splitlines()[2])
import feedparser
f = feedparser.parse('https://rachelbythebay.com/w/atom.xml')
e, = [e for e in f.entries if '2019/08/01/reliability' in e.link]
print(e.content[0].value.split('<p>')[1].splitlines()[2])
Outputs this:
<a href="/w/2019/07/21/reliability/">put forth</a>
<a href="/w/2019/07/21/reliability/">put forth</a>
So this is from feedparser, not reader.
Next steps:
Installing sgmllib3k results in:
<a href="https://rachelbythebay.com/w/2019/07/21/reliability/">put forth</a>
<a href="https://rachelbythebay.com/w/2019/07/21/reliability/">put forth</a>
Ideally, we should pull relative link resolution out of feedparser's control and into reader's (like we did with HTTP requests). This will also allow downloading assets (images etc.) in the future.
I assume sanitization also doesn't work (it probably relies on sgmllib). This should be documented / fixed ASAP, since it is a security issue.
Update: nope, sanitization doesn't work without sgmllib; from feedparser/sgml.py:
sgmllib is not available by default in Python 3; if the end user doesn't have it available then we'll lose illformed XML parsing and content sanitizing
Next steps:
So in the end, I made sgmllib3k a required dependency, and forced sanitization and link resolution on (commit above).
We can consider the problem fixed; the "ideally" part of the comment above can be considered a feature request.
Deploying 1.0 doesn't seem to fix it...
Update: Turns out it's update_feeds()'s fault; see #164 for details.
A few quick thoughts on how re-implementing sanitization would work:
Note: