lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
445 stars 37 forks source link

Broken relative links #125

Open lemon24 opened 5 years ago

lemon24 commented 5 years ago
lemon24 commented 5 years ago

This:

import reader
r = reader.Reader(':memory:')
r.add_feed('https://rachelbythebay.com/w/atom.xml')
r.update_feeds()
e, = [e for e in r.get_entries() if '2019/08/01/reliability' in e.link]
print(e.content[0].value.split('<p>')[1].splitlines()[2])

import feedparser
f = feedparser.parse('https://rachelbythebay.com/w/atom.xml')  
e, = [e for e in f.entries if '2019/08/01/reliability' in e.link]    
print(e.content[0].value.split('<p>')[1].splitlines()[2])

Outputs this:

<a href="/w/2019/07/21/reliability/">put forth</a>
<a href="/w/2019/07/21/reliability/">put forth</a>

So this is from feedparser, not reader.

Next steps:

lemon24 commented 5 years ago

Installing sgmllib3k results in:

<a href="https://rachelbythebay.com/w/2019/07/21/reliability/">put forth</a>
<a href="https://rachelbythebay.com/w/2019/07/21/reliability/">put forth</a>
lemon24 commented 5 years ago

Ideally, we should pull relative link resolution out of feedparser's control and into reader's (like we did with HTTP requests). This will also allow downloading assets (images etc.) in the future.

I assume sanitization also doesn't work (it probably relies on sgmllib). This should be documented / fixed ASAP, since it is a security issue.

Update: nope, sanitization doesn't work without sgmllib; from feedparser/sgml.py:

sgmllib is not available by default in Python 3; if the end user doesn't have it available then we'll lose illformed XML parsing and content sanitizing

Next steps:

lemon24 commented 4 years ago

So in the end, I made sgmllib3k a required dependency, and forced sanitization and link resolution on (commit above).

We can consider the problem fixed; the "ideally" part of the comment above can be considered a feature request.

lemon24 commented 4 years ago

Deploying 1.0 doesn't seem to fix it...

Update: Turns out it's update_feeds()'s fault; see #164 for details.

lemon24 commented 10 months ago

A few quick thoughts on how re-implementing sanitization would work:

Note: