For #37717, added support for Japanese utf-8 chars.

raphaelmatto commented 8 years ago

This code solves all issues in the EDL for ticket 37717, but I suspect other utf-8 issues will come up. Anyone looking at this: is there a better way to deal with encodings/decodings like this? I haven't done a lot of this, so feel very unsure about my solution even though it works here. I'm curious about big picture strategy for dealing w/utf-8. How often should we do it? Only when necessary, or everywhere? Should we encode/decode right after pulling info from the Shotgun api, or later / only where necessary in the code?

raphaelmatto commented 8 years ago

I noticed a related bug: Japanese chars don't work when input as a Group in Settings. I suspect the different omit statuses may also not work. Unicode support should probably be added for those.

manneohrstrom commented 8 years ago

The 'official' strategy in toolkit apps is to treat everything as utf-8 and avoid any unicode objects. All toolkit engines are setting themselves up so that CStrings are assumed to be in UTF-8 locale, meaning that the display of all characters will be correct even when they are stored in utf-8 encoded byte streams.

The approach that I would recommend would be this:

Convert any unicode objects straight away as they enter the system. This is typically via QT widgets. All other ways data enters the system is typically utf-8. Don't do any conversion where the problem is happening, do the conversion from a "widget value" to an "internal value" straight away as that piece of data enters the system.
No need to remove regex, just don't think of A-Z as a marker for 'all characters'. In Sweden for example, there are three characters (å, ä and ö) that come after z in the alphabet. Instead you can use \w. See this example in core
This is not exclusive to japanese, this utf-8 approach will support all uncicode.

Hope this helps!

sgsteph commented 8 years ago

The approach does not seem right, and it looks like a proper fix would be to ensure that the default encoding is utf8 and not ascii. I gave it a go and called this function in the app init, which seems to fix all unicode problems:

def ensure_utf8_encoding():
    # Make sure we don't use default "ascii" encoding
    import sys
    reload(sys)  # Reload to get back access to setdefaultencoding
    sys.setdefaultencoding("UTF8")

However, this shouldn't be done at the app level, as we are setting a global and this might have some side effects (bearing in mind I'm not a unicode expert at all):

Default encoding might not be ascii in Japan or other countries. Not sure what will happen then ?
If some code is explicitly doing some unicode mangling in some apps, this could break it ?

Following Manne's advices, I'm checking if there is another way to fix it, which would be to ensure we read EDL files as utf8, so basically all our input data is utf8.

Sounds like a good engineering meeting topic !

Apart from that, I noticed that import cut input widgets (e.g. shot names in summary view) do not accept unicode characters. This should be fixed as well.

sgsteph commented 8 years ago

I tried the recommended approach, converting inputs to utf-8. However, this is not enough as strings emitted in signals are converted to unicode, so the conversion needs to happen in receiving slots, involving adding conversions a various random places, just in case. A global approach allowing to not worry about it anymore would be preferable imho.

sgsteph commented 8 years ago

I created a new branch / pull request with what I think is a better approach: https://github.com/shotgunsoftware/tk-multi-importcut/pull/54

The idea is to follow the official TK paradigm and always convert unicode strings to utf-8 encoded strings before using them. This gives us a nice pattern that we can use in all apps I think: from now on, all slots expecting a string should assume it is a unicode string in all our apps.

shotgunsoftware / tk-multi-importcut

For #37717, added support for Japanese utf-8 chars. #53