Better filtering of user generated content

GoogleCodeExporter commented 9 years ago

Every place I display user generated content in my templates, I need to:

1. Run through oembed
2. Convert to HTML
3. Strip invalid HTML
4. Strip any javascript

I've created a wrapper that handles all of these with one filter. If there is 
interest, I can put together 
a patch.

Original issue reported on code.google.com by pete.lin...@gmail.com on 28 Jul 2008 at 6:18

GoogleCodeExporter commented 9 years ago

I'd be interested in taking a look at that. Always up for anything that helps to
alleviate such mechanics. :)

Original comment by bobwayc...@gmail.com on 1 Aug 2008 at 4:25

GoogleCodeExporter commented 9 years ago

Definitely -- eric might have some thoughts too.

Original comment by jtau...@gmail.com on 1 Aug 2008 at 3:07

GoogleCodeExporter commented 9 years ago

Here's the code I threw together for my filter that we could use as a starting 
point for discussion. Something 
for Pinax would obviously need to be more flexible allowing for choice of 
markup, enable/disable oembed, 
etc.
========

from BeautifulSoup import BeautifulSoup, Comment
from django import template
from django.utils.safestring import mark_safe
from django.utils.encoding import smart_str
from django.contrib.markup.templatetags.markup import textile
from oembed.core import replace as oembed_replace
import re

register = template.Library()

@register.filter
def user_input(value):
    """
    Modified from http://www.djangosnippets.org/snippets/205/
    1. Replace oembed
    2. Textile
    3. Clean w/ BeautifulSoup
    4. Strip javascript
    """

    #oembed & textile
    html = textile(oembed_replace(smart_str(value)))

    #Beautiful Soup
    valid_tags = 'p i strong ol ul li b u a blockquote pre br img embed'.split()
    valid_attrs = 'href src alt title'.split()
    soup = BeautifulSoup(html)
    for comment in soup.findAll(
        text=lambda text: isinstance(text, Comment)):
        comment.extract()
    for tag in soup.findAll(True):
        if tag.name not in valid_tags:
            tag.hidden = True
        tag.attrs = [(attr, val) for attr, val in tag.attrs
                     if attr in valid_attrs]
    souped = soup.renderContents().decode('utf8')

    #Strip javascript
    #gnarly regex to look for `javascript:` in the text
    regex = re.compile(
            'j[\s]*(&#x.{1,7})?a[\s]*(&#x.{1,7})?v[\s]*(&#x.{1,7})?a[\s]*(&#x.{1,7})?s[\s]*(&#x.{1,7})?c[\s]*(&#x.{1,7})?
r[\s]*(&#x.{1,7})?i[\s]*(&#x.{1,7})?p[\s]*(&#x.{1,7})?t', 
        re.IGNORECASE)
    cleaned = regex.sub('', souped)
    return mark_safe(cleaned)
user_input.is_safe = True

Original comment by sgt.hu...@gmail.com on 1 Aug 2008 at 7:30

GoogleCodeExporter commented 9 years ago

Can you please attached your code? We need to discuss what to put in the 
release for
djangocon.

Original comment by jtau...@gmail.com on 17 Aug 2008 at 4:21

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

I just wrapped sgt.hulka's code up for handy reference and attached it. If not 
in 0.7
release, it should be.

Original comment by pyDanny on 13 Mar 2009 at 2:53

Attachments:

user_content_filter.py

GoogleCodeExporter commented 9 years ago

Original comment by pyDanny on 13 Mar 2009 at 3:28

Added labels: Milestone-0.8

GoogleCodeExporter commented 9 years ago

Original comment by leidel on 13 Mar 2009 at 9:06

Added labels: Milestone-Post-0.7

pombreda / django-hotclub

Better filtering of user generated content #39