python / cpython

The Python programming language
https://www.python.org/
Other
60.03k stars 29.06k forks source link

Add "java modified utf-8" codec #47106

Closed 91e69f45-91d9-4b12-87db-a02908296c81 closed 12 years ago

91e69f45-91d9-4b12-87db-a02908296c81 commented 16 years ago
BPO 2857
Nosy @malemburg, @loewis, @birkenfeld, @abalkin, @vstinner, @ezio-melotti, @serhiy-storchaka
Files
  • utf_8_java.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', 'library', 'expert-unicode'] title = 'Add "java modified utf-8" codec' updated_at = user = 'https://bugs.python.org/phr' ``` bugs.python.org fields: ```python activity = actor = 'loewis' assignee = 'none' closed = True closed_date = closer = 'loewis' components = ['Library (Lib)', 'Unicode'] creation = creator = 'phr' dependencies = [] files = ['21965'] hgrepos = [] issue_num = 2857 keywords = ['patch'] message_count = 26.0 messages = ['66843', '66852', '66854', '66855', '66857', '66862', '66866', '67368', '123484', '123770', '135757', '135772', '135776', '135796', '135797', '141938', '141940', '141949', '141955', '141956', '141957', '142017', '159130', '159133', '159136', '159137'] nosy_count = 10.0 nosy_names = ['lemburg', 'loewis', 'georg.brandl', 'phr', 'belopolsky', 'moese', 'vstinner', 'ezio.melotti', 'tchrist', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = 'test needed' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue2857' versions = ['Python 3.3'] ```

    91e69f45-91d9-4b12-87db-a02908296c81 commented 16 years ago

    For object serialization and some other purposes, Java encodes unicode strings with a modified version of utf-8:

    http://en.wikipedia.org/wiki/UTF-8#Java http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

    It is used in Lucene index files among other places.

    It would be useful if Python had a codec for this, maybe called "UTF-8J" or something like that.

    malemburg commented 16 years ago

    What would you use such a codec for ?

    From the references you gave, it is only used internally for Java object serialization, so wouldn't really be of much use in Python.

    91e69f45-91d9-4b12-87db-a02908296c81 commented 16 years ago

    Some java applications use it externally. The purpose seems to be to prevent NUL bytes from appearing inside encoded strings which can confuse C libraries that expect NUL's to terminate strings. My immediate application is parsing lucene indexes:

    http://lucene.apache.org/java/docs/fileformats.html#Chars

    91e69f45-91d9-4b12-87db-a02908296c81 commented 16 years ago

    Also, according to wikipedia, tcl also uses that encoding.

    malemburg commented 16 years ago

    TCL only uses the codec for internal represenation. You might want to interface to TCL at the C level and use the codec there, but is that really a good reason to include the codec in the Python stdlib ?

    Dito for parsing Lucene indexes.

    I think you're better off writing your own codec and registering it with the Python codec registry at application start-up time.

    91e69f45-91d9-4b12-87db-a02908296c81 commented 16 years ago

    I'm not sure what you mean by "ditto for Lucene indexes". I wasn't planning to use C code. I was hoping to write Python code to parse those indexes, then found they use this weird encoding, and Python's codec set is fairly inclusive already, so this codec sounded like a reasonably useful addition. It probably shows up other places as well. It might even be a reasonable internal representation for Python, which as I understand it currently can't represent codepoints outside the BMP. Also, it is used in Java serialization, which I think of as a somewhat weird and whacky thing, but it's conceivable that somebody someday might want to write a Python program that speaks the Java serialization protocol (I don't have a good sense of whether that's feasible).

    Writing an application specific codec with the C API is doable in principle, but it seems like an awful lot of effort for just one quickie program. These indexes are very large and so writing the codec in Python would probably be painfully slow.

    birkenfeld commented 16 years ago

    Since we also support oddball codecs like UTF-8-SIG, why not this one too?

    Given the importance of UTF-8, it seems a good idea to support common variations.

    malemburg commented 15 years ago

    Ok, if you can write a patch implementing the codec, we'll add it.

    Please use the name "utf-8-java" and codec name utf_8_java.py.

    abalkin commented 13 years ago

    TCL only uses the codec for internal represenation. You might want to interface to TCL at the C level and use the codec there, but is that really a good reason to include the codec in the Python stdlib ?

    I wonder if tkinter should use this encoding.

    vstinner commented 13 years ago

    I wonder if tkinter should use this encoding.

    Tkinter is used to build graphical interfaces. I don't think that users write nul bytes with their keyboard. But there is maybe a use case?

    7ca37d35-dd76-41a3-af7b-5fba383a62c4 commented 13 years ago

    I use the hachoir Python package to parse Java .class files and extract the strings from them and having support for Java modified UTF-8 would have been nice.

    vstinner commented 13 years ago

    utf_8_java.patch: Implement "utf-8-java" encoding.

    For the doc, I just added a line "utf-8-java" in the codec list, but I did not add a paragraph to explain how this codec is different to utf-8. Does anyone have a suggestion?

    malemburg commented 13 years ago

    Thanks for the patch, Victor.

    Some comments on the patch:

    Since the ticket was opened in 2008, the common name of the codec appears to have changed from "UTF-8 Java" to "Modified UTF-8" or "MUTF-8" as short alias:

    So I guess we should adapt to the name to the now common name and call it "ModifiedUTF8" in the C API and add these aliases: "utf-8-modified", "mutf-8" and "modified-utf-8".

    vstinner commented 13 years ago

    See also issue bpo-1028.

    vstinner commented 13 years ago

    Benchmark: a) ./python -m timeit "(b'\xc3\xa9' 10000).decode('utf-8')" b)./python -m timeit "(''.join( map(chr, range(0, 128)) )1000).encode('utf-8')" c) ./python -m timeit "f=open('Misc/ACKS', encoding='utf-8'); acks=f.read(); f.close()" "acks.encode('utf-8')" d) ./python -m timeit "f=open('Misc/ACKS', 'rb'); acks=f.read(); f.close()" "acks.decode('utf-8')"

    Original -> patched (smallest value of 3 runs): a) 85.8 usec -> 83.4 usec (-2.8%) b) 548 usec -> 688 usec (+25.5%) c) 132 usec -> 144 usec (+9%) d) 65.9 usec -> 67.3 usec (+2.1%)

    Oh, decode 2 bytes sequences are faster with my patch. Strange :-)

    But 25% slower to encode a pure ASCII text is not a good news.

    5c59cbd7-8186-4351-8391-b403f3a3a73f commented 12 years ago

    Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:

    http://unicode.org/reports/tr26/

    CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who misunderand Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need to be able to read it, but call it what it is, please.

    Despite the talk about Lucene, I note that the Perl port of Lucene uses real UTF-8, not CESU-8.

    birkenfeld commented 12 years ago

    +1 for calling it by the correct name (the docs can of course state that this is equivalent to "Java Modified UTF-8" or however they like to call it).

    malemburg commented 12 years ago

    Tom Christiansen wrote:

    Tom Christiansen \tchrist@perl.com\ added the comment:

    Please do not call this "utf-8-java". It is called "cesu-8" per UTS#18 at:

    http://unicode.org/reports/tr26/

    CESU-8 is *not* a a valid Unicode Transform Format and should not be called UTF-8. It is a real pain in the butt, caused by people who misunderand Unicode mis-encoding UCS-2 into UTF-8, screwing it up. I understand the need to be able to read it, but call it what it is, please.

    Despite the talk about Lucene, I note that the Perl port of Lucene uses real UTF-8, not CESU-8.

    CESU-8 is a different encoding than the one we are talking about.

    The only difference between UTF-8 and the modified one is the different encoding for the U+0000 code point to have the output not contain any NUL bytes.

    malemburg commented 12 years ago

    Corrected the title again. See my comment.

    malemburg commented 12 years ago

    Marc-Andre Lemburg wrote:

    Corrected the title again. See my comment.

    Please open a new ticket, if you want to add a CESU-8 codec.

    Looking at the relevant use cases, I'm at most +0 on adding the modified UTF-8 codec. I think such codecs can well live outside the stdlib on PyPI.

    7ca37d35-dd76-41a3-af7b-5fba383a62c4 commented 12 years ago

    Python does have other "weird" encodings like bz2 or rot13.

    Beside, batteries included :)

    vstinner commented 12 years ago

    Python does have other "weird" encodings like bz2 or rot13.

    No, it has no more such weird encodings.

    serhiy-storchaka commented 12 years ago

    As far as I understand, this codec can be implemented in Python. There is no need to modify the interpreter core.

    def decode_cesu8(b):
        return re.sub('[\uD800-\uDBFF][\uDC00\DFFF]', lambda m: chr(0x10000 | ((ord(m.group()[0]) & 0x3FF) << 10) | (ord(m.group()[1]) & 0x3FF)), b.decode('utf-8', 'surrogatepass'))
    
    def encode_cesu8(s):
        return re.sub('[\U00010000-\U0010FFFF]', lambda m: chr(0xD800 | ((ord(m.group()) >> 10) & 0x3FF)) + chr(0xDC00 | (ord(m.group() & 0x3FF)), s).encode('utf-8', 'surrogatepass')
    
    def decode_mutf8(b):
        return decode_cesu8(b.replace(b'\xC0\x80', b'\x00'))
    
    def encode_mutf8(s):
        return encode_cesu8(s).replace(b'\x00', b'\xC0\x80')
    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

    Serhiy: your functions to not constitute a Python codec. For example, there is no support for error handlers in them.

    serhiy-storchaka commented 12 years ago

    Serhiy: your functions to not constitute a Python codec. For example, there is no support for error handlers in them.

    Yes, it is not a codec in Python library terminology. It's just a pair of functions, the COder and DECoder, which is enough for the task of hacking Java serialized data. I don't think that such specific task leads to the change of the interpreter core.

    However, translators that convert the non-BMP characters to a surrogate pair and back, would be useful in the standard library. They need to work with a non-standard encodings (CESU-8, MUTF-8, cp65001, some Tk/IDLE issues). This is a fairly common task.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

    Ok, I'm closing this entire issue as "won't fix", then. There apparently is a need for functionality like this, but there is apparently also a concern that this is too specialized for the standard library.

    As it is possible to implement this as a stand-alone library, I encourage interested users to design a package for PyPI that has this functionality collected for reuse. If the library is then widely used after some time, this issue can be reconsidered.