Clean up bad unicode chars before posting

GoogleCodeExporter commented 9 years ago

Here's what we added to the post method:

    def _post(self, url, body, headers):
        # clean up the body
        #section 2.2 of the XML spec. Three characters from the 0x00-0x1F block are allowed: 
0x09, 0x0A, 0x0D.
        body = body.replace("\x00","")
        body = body.replace("\x01","")
        body = body.replace("\x02","")
        body = body.replace("\x03","")
        body = body.replace("\x04","")
        body = body.replace("\x05","")
        body = body.replace("\x06","")
        body = body.replace("\x07","")
        body = body.replace("\x08","")
        body = body.replace("\x0b","")
        body = body.replace("\x0c","")
        body = body.replace("\x0e","")
        body = body.replace("\x0f","")
        body = body.replace("\x10","")
        body = body.replace("\x11","")
        body = body.replace("\x12","")
        body = body.replace("\x13","")
        body = body.replace("\x14","")
        body = body.replace("\x15","")
        body = body.replace("\x16","")
        body = body.replace("\x17","")
        body = body.replace("\x18","")
        body = body.replace("\x19","")
        body = body.replace("\x1A","")
        body = body.replace("\x1B","")
        body = body.replace("\x1C","")
        body = body.replace("\x1D","")
        body = body.replace("\x1E","")
        body = body.replace("\x1F","")

Original issue reported on code.google.com by br...@echonest.com on 24 Mar 2008 at 4:03

GoogleCodeExporter commented 9 years ago

Thanks for opening this, but I don't really think it belongs to solrpy. There's 
nothing in the code that might cause the behavior and putting it here may almost
certainly hide some subtle bugs in the client code, e.g. if an input character
is outside the spec then why not raise a ValueError rather than silently convert
it to a default value?

Original comment by ds...@gefira.pl on 12 Sep 2008 at 10:12

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

I would like to question the reasoning for not fixing this issue in solrpy.

Why should the user of the solrpy API care whether the access method to Solr 
server
is in fact XML? In core.py, there is already code to escape special characters:

from xml.sax.saxutils import escape, quoteattr

So there is already code in place to perform escaping of the inputted text in 
order
to create the XML request. By your reasoning, one should raise a ValueError 
inside
solrpy if a text field contains '<' or '>'.

Why not use the XML library to build a sane XML request instead of concatenating
unicode strings by hand? See http://hsivonen.iki.fi/producing-xml/

Original comment by henri.o...@gmail.com on 30 Apr 2010 at 1:44

karanjeets / solrpy

Clean up bad unicode chars before posting #1