vavavr00m / boto

Automatically exported from code.google.com/p/boto
1 stars 0 forks source link

boto can't handle bucketlistresults with control characters #501

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
If you create an object with the following name in your bucket:
"[control-V][control-V][control-V]"

You bucket will become unlistable by boto. Any time you try to list the bucket 
contents, you'll get something similar to this:

Traceback (most recent call last):
  File "./boto_tool.py", line 229, in <module>
    sys.exit(list_obj(bucket_name, sys.argv[4:]))
  File "./boto_tool.py", line 134, in list_obj
    for key in bucket.list(prefix = prefix):
  File "/usr/local/lib/python2.6/dist-packages/boto-2.0b5-py2.6.egg/boto/s3/bucketlistresultset.py", line 30, in bucket_lister
    delimiter=delimiter, headers=headers)
  File "/usr/local/lib/python2.6/dist-packages/boto-2.0b5-py2.6.egg/boto/s3/bucket.py", line 348, in get_all_keys
    '', headers, **params)
  File "/usr/local/lib/python2.6/dist-packages/boto-2.0b5-py2.6.egg/boto/s3/bucket.py", line 311, in _get_all
    xml.sax.parseString(body, h)
  File "/usr/lib/python2.6/xml/sax/__init__.py", line 49, in parseString
    parser.parse(inpsrc)
  File "/usr/lib/python2.6/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.6/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.6/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.6/xml/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:2:194: reference to invalid 
character number

The problem seems to be that Amazon and the libexpat guys disagree over whether 
control characters are valid in XML. Expat takes the view that control 
characters are never legal, unless they're #x9, #xA, or #xD. 

See http://mail.libexpat.org/pipermail/expat-discuss/2010-May/002679.html

Amazon, on the other hand, happily encodes control characters in object names 
using entity references. It's hard to see what choice they have, since the only 
spec for a legal object name is:

"The name for a key is a sequence of Unicode characters whose UTF-8 encoding is 
at most 1024 bytes long." 

Since control characters are valid UTF-8, they must have a way to return them 
in the XML.

Original issue reported on code.google.com by RareCactus@gmail.com on 25 Mar 2011 at 7:45

GoogleCodeExporter commented 9 years ago
This is a long-standing issue and one that I really have no good solution for.  
The XML that is being returned by S3 is illegal and the parser is perfectly 
right to raise an exception.  I would be happy to entertain any possible 
solutions.

Original comment by Mitch.Ga...@gmail.com on 19 Jul 2011 at 1:18

GoogleCodeExporter commented 9 years ago

Original comment by Mitch.Ga...@gmail.com on 19 Jul 2011 at 2:11

GoogleCodeExporter commented 9 years ago
Yeah, there's no obvious solution. It would be nice if the XML parser could 
operate in a less strict mode, but that doesn't seem to be an option.

We could, perhaps, use a regular expression to somehow transform the input 
before giving it to the XML parser. You would have to invent a simple encoding 
scheme for control characters. Then afterwards, another regular expression 
could decode the output.

It's awkward, but that's what happens when people don't build flexibility into 
libraries!

Original comment by RareCactus@gmail.com on 19 Jul 2011 at 5:10