rdmenezes / cefpython

Automatically exported from code.google.com/p/cefpython
1 stars 0 forks source link

Strings should be unicode by default, if bytes is required make it explicit #60

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Right now in Py2 byte strings are passed to python, but in Py3
unicode strings are passed. This will cause troubles in the 
future when upgrading existing python code from Py2 to Py3.

If you write code in Py2 that decodes the byte strings then 
this will break in Py3 with decode errors, because when 
calling decode on a unicode string doesn't make sense and 
what python does is it encodes it to byte string using the
'ascii' codec and then decodes it back using the 'utf-8' 
codec and this will cause errors.

Strings that are paths to files should be bytes strings in Py2,
otherwise you might get into trouble see this post:

https://groups.google.com/d/msg/cython-users/Q1_jyOX4tVM/f8vsYDuUWL0J

In Cython code do an explicit conversion to 'bytes' when you
do not want unicode string.

See the ApplicationSettings.unicode_to_bytes_encoding option.
This option is also used when converting bytes to unicode, so
it's name is confusing, it should be renamed.

How do we know what kind of encoding should be used when decoding
the javascript strings? Web pages might have encodings other than
utf-8. Is there an API in CEF to get the encoding of a current 
Frame? Making a fixed encoding through application settings
doesn't make too much sense, as different websites might use
different encodings. Still, the encoding of the strings that users
pass to cefpython should be configurable through some option,
as it might be different than the encoding that website in current
context uses.

Original issue reported on code.google.com by czarek.t...@gmail.com on 4 Jun 2013 at 2:10

GoogleCodeExporter commented 9 years ago
Take a look at this video explaining of why "Unicode is poison to 
python performance":

http://www.youtube.com/watch?v=oK3EQH5Wdqo&feature=youtu.be&t=24m26s

Don't stick unicode everywhere.

Original comment by czarek.t...@gmail.com on 4 Jun 2013 at 5:53

GoogleCodeExporter commented 9 years ago
Strings returned from javascript and strings passed to javascript
should probably be always utf-8 and for these conversions the
ApplicationSettings.string_encoding should not be used (not 100%
sure about that, need to test it).

Original comment by czarek.t...@gmail.com on 5 Jun 2013 at 6:52

GoogleCodeExporter commented 9 years ago
Starting with next commit all the encoding/decoding of strings
will be kept in one file string_utils.pyx, this will make it
easier to make the unicode the default strings, but still you
have to check all the calls to CefToPyString(), CharToPyString(),
VoidPtrToStr() to check to what type the string is assigned to,
what is the context of this operation, whether unicode won't
break anything.

The documentation on the wiki pages needs to be updated, "str"
types need to be replaced with "unicode".

That's not all, there are other fixed strings in the code, if
we decide to use unicode then we must stick to it and all the
strings passed to python should be unicode, this is going to
be a bit of a nightmare, we would be forced to use u"" syntax
(in Py3 such syntax is disallowed, but in Cython it is allowed,
so it is a bit easier to write portable code for both Py2/Py3),
but what if we pass normal byte string instead in Py2? Then this 
is going to be a hell in user code, as concatenating bytes 
string with unicode string will throw a TypeError "can't concat
bytes to str".

Original comment by czarek.t...@gmail.com on 5 Jun 2013 at 7:12

GoogleCodeExporter commented 9 years ago
More thoughts on making unicode default in Py27 here:
https://groups.google.com/d/msg/cython-users/VICzhVn-zPw/B0U4_AK36UkJ

Original comment by czarek.t...@gmail.com on 9 Jan 2014 at 7:36

GoogleCodeExporter commented 9 years ago
Marking as Won't Fix. Use Python 3 if you need unified unicode strings. Fixing 
this would break backwards compatibility.

Original comment by czarek.t...@gmail.com on 10 Aug 2014 at 6:03