document that text/plain content isn't sanitized

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. read the page at http://www.feedparser.org/docs/html-sanitization.html
2. read the code (perhaps even test it out!)
3. do not profit!

What is the expected output? What do you see instead?
3. profit! 

What version of the product are you using? On what operating system?

I got feedparser from the repository(latest). 

Please provide any additional information below.

Ok so I was using feedparser to parse a twitter atom feed.
The problem is that the title contained what I wanted and it 'can contain html' 
but it is not 'sanitized'.

However, as it appears like this:
<title>@X &lt;script&gt;alert(1);&lt;/script&gt;</title>

and NOT like this:
<title type="html">@X &lt;script&gt;alert(1);&lt;/script&gt;</title>

it will not be sanitized - because is_htmlish is False.

----> 
        if is_htmlish and SANITIZE_HTML:
            if element in self.can_contain_dangerous_markup:
                output = _sanitizeHTML(output, self.encoding, self.contentparams.get('type', 'text/html'))

Sure twitter might be "wrong" here. However, the documentation when I first 
used this module stated (and still does) "By default, Universal Feed Parser 
sanitizes HTML markup in several elements, removing HTML tags and attributes 
that could introduce Javascript or other security risks" AND

"These elements are sanitized by default:
...
entries[i].title
..."
.

IMHO the result should be sanitized by default regardless of if it is 
'is_htmlish'.

Original issue reported on code.google.com by db.pub.m...@gmail.com on 15 Feb 2011 at 8:45

GoogleCodeExporter commented 9 years ago

It's the developer's responsibility to check the content type returned with the 
title, which should indicate how to treat the content (i.e. "text/plain" would 
need to be escaped, whereas "text/html" will have been sanitized and should be 
safe to output unescaped).

Without seeing the feed, I took a look at how the content might trace through 
the parser. `_start_title()` assumes that the content is "text/plain", so 
without a type attribute it later falls on `lookslikehtml()` to guess what the 
content is. `lookslikehtml()` takes a very cautious approach to changing the 
content type to avoid data loss. In this case, I expect that `lookslikehtml()` 
will return False, the content type will remain "text/plain", and the text 
won't be sanitized.

The behavior I've described above is expected, and the type should be available 
in the "title_detail" dictionary key, and should be set to "text/plain". 
However, if this isn't the case, please provide the URL to a 
publicly-accessible Twitter feed that's exhibiting this behavior or reply back 
and attach a downloaded copy of the feed.

Original comment by kurtmckee on 16 Feb 2011 at 1:50

Changed title: text/plain content isn't sanitized

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Yes it as like you said. However, the documentation on html sanitization is not 
_clear_ enough. 

An example of a twitter atom search result is -->  
http://search.twitter.com/search.atom?q=%40twitter

Original comment by db.pub.m...@gmail.com on 16 Feb 2011 at 8:11

GoogleCodeExporter commented 9 years ago

Alright, I think I'm on the same page now. Yes, the documentation requires an 
overhaul; the page you cite still claims that CSS styles are stripped, for 
example, which is no longer true. I would be delighted to receive patches or 
git pull requests at github [1], otherwise I'll work on the documentation when 
I'm able.

[1]: https://github.com/kurtmckee/feedparser/

Original comment by kurtmckee on 16 Feb 2011 at 9:10

Changed state: Accepted
Added labels: Type-Enhancement, Component-Docs

GoogleCodeExporter commented 9 years ago

Fair enough. I mean - the 'basic' part of the django 'escape' function could be 
used (optionally) (that is replace < > & " ' with their html counterparts) .

Original comment by db.pub.m...@gmail.com on 16 Feb 2011 at 9:24

GoogleCodeExporter commented 9 years ago

Original comment by kurtmckee on 27 Nov 2011 at 9:36

Changed title: document that text/plain content isn't sanitized

GoogleCodeExporter commented 9 years ago

Fixed in r642 (and the new documentation should be available shortly; I'll 
publish the link on the frontpage and announce it on the mailing list).

Original comment by kurtmckee on 27 Nov 2011 at 10:49

Changed state: Fixed

pombreda / feedparser

document that text/plain content isn't sanitized #253