couchdb-dump cannot deal with unicode characters in doc ids

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.Create a document in couchdb, with some Chinese character like "文档"
2.Run couchdb-dump on the database

What is the expected output? What do you see instead?
couchdb-dump crashes upon reaching this document. Here are the last lines of 
the trace:
  File "/pylonsenv/lib/python2.6/site-packages/couchdb/multipart.py", line 122, in __init__
    self._write_headers(headers)
  File "/pylonsenv/lib/python2.6/site-packages/couchdb/multipart.py", line 175, in _write_headers
    self.fileobj.write(headers[name])
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: 
ordinal not in range(128)

What version of the product are you using? On what operating system?
couchdb-python 0.8 against couchdb 1.0.1 on Ubuntu.

Original issue reported on code.google.com by heshim...@gmail.com on 14 May 2011 at 7:58

GoogleCodeExporter commented 9 years ago

I just needed a quick solution to dump the database and reload it in another 
environment. So I made some changes to multipart.py to get pass this utf-8 
thing. It did work.

However, I understand that other parts are using multipart.py too. This 
probably won't fit the MIME standard. If I have time, I'll investigate further 
and provide a patch that does satisfy the MIME standard.

Original comment by heshim...@gmail.com on 14 May 2011 at 8:20

Attachments:

utf-8_dump_load.patch

GoogleCodeExporter commented 9 years ago

Confirm. There is also invalid test case about how multipart module works with 
unicode data: StringIO could handle mixed "str" and "unicode" values, but files 
requires only "str" one.

Original comment by kxepal on 14 May 2011 at 8:20

GoogleCodeExporter commented 9 years ago

Sorry, I was wrong about tests - StringIO confused me(: Don't rush, sit down 
and think about...yes(: 
There is no needs to fix multipart module, only dump tool due to it pass 
unicode document id to multipart writer. This is about dump-tool.patch.

dump-tool-2.patch solves same problem, but with respect of Content-Type header 
and his charset. I suppose, that would a more correct solution.

Original comment by kxepal on 14 May 2011 at 9:20

Attachments:

GoogleCodeExporter commented 9 years ago

Ah, that's much smarter. Thanks!

Original comment by heshim...@gmail.com on 14 May 2011 at 10:43

GoogleCodeExporter commented 9 years ago

Hmm... another thing. I was under the impression that utf-8 encoded strings 
aren't valid ascii. Currently, isn't multipart.py expecting strict ascii 
strings as header?

Original comment by heshim...@gmail.com on 14 May 2011 at 10:48

GoogleCodeExporter commented 9 years ago

Actually, only first 128 chars of utf-8 encoding are valid ascii. Problem was 
not in what characters in headers, but in type of string multipart tries to 
write into output stream. Files and streams doesn't expects pure unicode 
strings, but favors stings called as "bytes" in Python 3 terminology and 
multipart module expects this behavior.

But there was a "hack" which adds to headers document id which used by 
couchdb-load tool to help create document with same id value. Since document id 
could be unicode, this "hack" breaks expectations and makes multipart crash.

You could try revert patch and replace in dump.py default value of output 
argument in dump_db function from sys.stdout to StringIO.StringIO and error 
wouldn't be occurred because StringIO could handle both str and unicode values.

Original comment by kxepal on 14 May 2011 at 11:14

GoogleCodeExporter commented 9 years ago

IMO the correct way to have non-ASCII strings in MIME headers would be to use 
RFC 2047 encoding for any non-ascii header values.

Original comment by djc.ochtman on 14 May 2011 at 12:24

GoogleCodeExporter commented 9 years ago

Correct, but looks like an overhead in such case, because it would applied only 
to one header while others should follow RFC 822. Wouldn't be better to use 
base64 encoding?

Original comment by kxepal on 14 May 2011 at 12:50

GoogleCodeExporter commented 9 years ago

Hmm... I'd like to make a note here that kxepal's dump-tool-2.patch actually 
generated some invalid multipart boundaries.

Original comment by heshim...@gmail.com on 2 Jun 2011 at 6:47

GoogleCodeExporter commented 9 years ago

Original comment by djc.ochtman on 21 Sep 2012 at 8:32

GoogleCodeExporter commented 9 years ago

Original comment by wickedg...@gmail.com on 22 Sep 2012 at 12:44

Added labels: Milestone-0.9

GoogleCodeExporter commented 9 years ago

Any progress on this?

Original comment by djc.ochtman on 22 Oct 2012 at 11:26

GoogleCodeExporter commented 9 years ago

Yes, will submit patch with tests during this week. I'd agreed with you about 
RFC 2047 specification, so diving into it.

Original comment by kxepal on 22 Oct 2012 at 11:33

GoogleCodeExporter commented 9 years ago

Patch attached. Non-ascii headers now encoded following RFC 2047. Actually, I 
feel to rewrite multipart module to let him base on top of email package, but 
probably that would be another issue - need to workaround some email specific 
features to keep backward compatibility.

Original comment by kxepal on 24 Apr 2013 at 5:20

Changed state: Accepted

Attachments:

couchdb-python_485.patch

GoogleCodeExporter commented 9 years ago

Sorry, forgot to cleanup testing prints. Reattached.

Original comment by kxepal on 24 Apr 2013 at 5:25

Attachments:

couchdb-python_485.patch

GoogleCodeExporter commented 9 years ago

Pushed a slightly changed patch as rce40fd77ae8d, thanks!

Original comment by djc.ochtman on 25 Apr 2013 at 10:09

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Original comment by djc.ochtman on 25 Apr 2013 at 11:16

Removed labels: Milestone-0.9

jur9526 / couchdb-python

couchdb-dump cannot deal with unicode characters in doc ids #179