python / cpython

The Python programming language
https://www.python.org
Other
62.15k stars 29.87k forks source link

Add utilities to "clean" surrogate code points from strings #63014

Open ncoghlan opened 11 years ago

ncoghlan commented 11 years ago
BPO 18814
Nosy @malemburg, @ncoghlan, @pitrou, @vstinner, @ezio-melotti, @stevendaprano, @bitdancer, @yaseppochi, @vadmium, @serhiy-storchaka
Dependencies
  • bpo-22286: Allow backslashreplace error handler to be used on input
  • bpo-23676: Add support of UnicodeTranslateError in standard error handlers
  • Files
  • convert_surrogates.py
  • codecs_convert_escapes.patch
  • codecs_convert_escapes_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', 'type-feature', 'library'] title = 'Add utilities to "clean" surrogate code points from strings' updated_at = user = 'https://github.com/ncoghlan' ``` bugs.python.org fields: ```python activity = actor = 'ncoghlan' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'ncoghlan' dependencies = ['22286', '23676'] files = ['36700', '38506', '38520'] hgrepos = [] issue_num = 18814 keywords = ['patch'] message_count = 55.0 messages = ['195937', '195959', '195960', '196047', '196049', '196054', '196063', '225791', '225793', '225800', '225801', '225806', '225809', '225811', '225817', '225818', '225822', '225858', '225869', '225875', '225889', '225972', '227334', '227339', '227340', '227341', '227342', '227343', '227344', '227345', '227346', '227362', '227364', '227365', '227367', '227368', '227369', '238182', '238193', '238277', '238285', '238286', '238309', '242799', '242810', '242939', '251682', '251693', '251694', '251701', '251708', '251740', '251836', '251854', '314682'] nosy_count = 11.0 nosy_names = ['lemburg', 'ncoghlan', 'pitrou', 'vstinner', 'ezio.melotti', 'Arfrever', 'steven.daprano', 'r.david.murray', 'sjt', 'martin.panter', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18814' versions = ['Python 3.8'] ```

    Linked PRs

    ncoghlan commented 11 years ago

    Prompted by bpo-18713 and http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/, here are some possible utilities we could add to the codecs module to help deal with/debug issues related to surrogate escaped strings:

        def has_escaped_bytes(s):
            """Returns true if string contains surrogate escaped bytes"""
            ...
    
        def replace_escaped_bytes(s):
            """Replaces each surrogate escaped byte with a valid code point"""
            ...
    
        def decode_escaped_bytes(s, nominal_encoding, actual_encoding):
            """Reinterprets incorrectly decoded text using a new encoding"""
            return s.encode(nominal_encoding, 'surrogateescape').decode(actual_encoding)
    bitdancer commented 11 years ago

    The email package needs has_escaped_bytes. Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand.

    What does replace do? Replace it with the unknown character code point? Email also needs that.

    serhiy-storchaka commented 11 years ago

    These functions are trivial.

        def has_escaped_bytes(s):
            try:
                s.encode()
                return False
            except UnicodeEncodeError:
                return True
    
        def replace_escaped_bytes(s):
            return s.encode('utf-8', 'replace').decode()
    pitrou commented 11 years ago

    Can you sum the use cases for these? (don't want to read a blog post, sorry :-))

    vstinner commented 11 years ago

    "The email package needs has_escaped_bytes. Currently it tries to encode to ascii to find out if there are any, which we proved by microbenchmark is the fastest way to do it as things stand."

    In which function you need to check this? What do you do if there are surrogate characters?

    bitdancer commented 11 years ago

    The email package uses surrogateescape to store unknown bytes in unicode strings, just as with the handle-bad-data-from-os API surrogateescape was introduced for. (For the same reason: the source data may have improperly encoded bytes that we must nevertheless preserve). What action certain parts of the code takes differs depending on whether or not there are such encoded bytes in the string. So the code needs to check and branch based on that.

    The utility function is email.utils._has_surrogates. You can grep the email source to see where it is used, if you like.

    ncoghlan commented 11 years ago

    The use case is to take data from a surrogate escaped interface and either filter it out entirely or convert it to a valid Unicode string at the point of *input*, before letting it make its way into the rest of the application. For example, this approach permits the Python equivalent of the Linux console behaviour when it attempts to display the improperly encoded file name.

    This approach is likely to be easier to implement in cross-platform code than dropping down to the bytes interfaces on POSIX systems and using the Unicode APIs on Windows.

    ncoghlan commented 10 years ago

    Based on the latest round of bytes handling discussions on python-dev, I came up with this updated proposal:

        # Constant in the string module (akin to string.ascii_letters et al)
        escaped_surrogates = bytes(range(128, 256)).decode('ascii', errors='surrogateescape')
    
        # Helper to ensure a string contains no escaped surrogates
        # This allows it to be safely encoded without surrogateescape
        _match_surrogates = re.compile('[{}]'.format(escaped_surrogates))
        def clean(s, repl='\ufffd'):
            return _match_surrogates.sub(repl, s)
    
        # Helper to redecode a string that was decoded incorrectly
        # For example, WSGI strings are passed from the server to the
        # framework as latin-1 by default and may need to be redecoded
        def redecode(s, encoding, errors='strict', old_encoding='latin-1', old_errors='strict'):
            return s.encode(old_encoding, old_errors).decode(encoding, errors)

    In addition to the concrete use cases David describes, I think these will also serve a useful documentation purpose, in highlighting the two main mechanisms for "smuggling" raw binary data through text APIs (i.e. surrogate escapes and latin-1 decoding).

    ncoghlan commented 10 years ago

    Note: I dropped the "has_escaped_bytes" idea, as "s != string.clean(s)" is functionally equivalent, albeit a bit more expensive if you don't actually want the cleaned string)

    ezio-melotti commented 10 years ago

    I think similar functions should be added in the unicodedata module rather than the string module or as str methods. If I'm not mistaken this was already proposed in another issue. In C we already added macros like IS_{HIGH|LOW|}_SURROGATE and possibly others to help dealing with surrogates but AFAIK there's no Python equivalent yet. As for the specific constants/functions/methods you propose, IMHO the name escaped_surrogates is not too clear. If it's a string of lone surrogates I would just call it unicodedata.surrogates (and .high_surrogates/.low_surrogates). These can also be used to build oneliner to check if a string contains surrogates and/or to remove them. clean has a very generic name with no hints about surrogates, and its purpose is quite specific. I'm also not a big fan of redecode. The equivalent calls to encode/decode are not much longer and more explicit. Also having to redecode often indicates that there's a bug before that should be fixed instead (if possible).

    ncoghlan commented 10 years ago

    The purpose of these changes it to provide tools specifically for working with surrogate escaped data, not for working with arbitrary lone Unicode surrogates.

    "escaped_surrogates" is not defined by the Unicode spec, it's defined by the behaviour of the surrogateescape error handler that lets us tunnel arbitrary bytes through str objects and reproduce them faithfully at the far end. On reflection, I think codecs would be a better home than string (as that's where the error handler is defined), but it doesn't belong in unicodedata.

    I'd be OK with changing the name of the clean function to "clean_escaped_surrogates".

    Needing redecode is not a bug: it's baked into the WSGI spec in PEP-3333. I would be OK with providing it in wsgiref rather than the codecs or string modules, but I think we should provide it somewhere.

    serhiy-storchaka commented 10 years ago

    I agree with Ezio in all points. escaped_surrogates is inefficient for any purposes and incomplete. _match_surrogates can be created in more efficient way. clean() has too general and misleading name. redecode() looks just ridiculous, this cumbersome 4-argument function just combine two methods.

    ncoghlan commented 10 years ago

    Guys, you're Python 3 unicode experts, already thoroughly familiar with how surrogateescape works. These features are not for you.

    The exact implementations don't matter. These need to exist, and they need to be documented with detailed explanations. People don't yet understand what the surrogateescape error handler is, and what it's for, or how to work with it.

    People consider surrogate escaping this weird mysterious thing. It's not - it's a really simple mapping of the 128 bytes with the high order bit set to a different part of the Unicode code space. These functions make it obvious.

    wsgiref.redecode is a natural consequence of the design of PEP-3333 - it makes sense to offer it there, as a hint that frameworks are likely to need it.

    We have a serious, serious, learnability problem around binary data handling in Python 3. It's nowhere near as bad as the learnability problem for text handling in Python 2, but it still needs more scaffolding to help people grow in understanding.

    ezio-melotti commented 10 years ago

    That's why I think a function like redecode is a bad idea. With Python 2 I've seen lot of people blindingly trying .decode when .encode failed (and the other way around) whenever they were getting an UnicodeError (and the fact that decoding Unicode results in an encode error didn't help). I'm afraid that confused developers will try to (mis)use redecode as a workaround to attempt to fix something that shouldn't be broken in the first place, without actually understanding what the real problem is. There's really no excuse for developers not to know Unicode nowadays (Joel Spolsky said the same thing more than 10 years ago). IME people also tend to overestimate how difficult it is to understand Unicode. While it is true that you can go pretty deep down the rabbit hole, the basic concepts necessary for everyday use are pretty simple and straightforward.

    ncoghlan commented 10 years ago

    The redecode thing is a distraction from my core concern here, so I've split that out to issue bpo-22264, a separate RFE for a "wsgiref.fix_encoding" function.

    For this issue, my main concern is the function to *clean a string of escaped binary data, so it can be displayed easily, or otherwise purged of the escaped characters. Preserving the data by default is good, but you have to know a *lot about how Python 3 works in order to be able figure out how to clean it out.

    For that, not knowing Unicode in general isn't the problem: it's not knowing PEP-383. If we forget the idea of exposing the constant with the escaped values (I agree that's not very useful), it suggests "codecs.clean_surrogate_escapes" as a possible name:

        # Helper to ensure a string contains no escaped surrogates
        # This allows it to be safely encoded without surrogateescape
        _extended_ascii = bytes(range(128, 256))
        _escaped_surrogates = _extended_ascii.decode('ascii',
                                        errors='surrogateescape')
        _match_escaped = re.compile('[{}]'.format(_escaped_surrogates))
        def clean_surrogate_escapes(s, repl='\ufffd'):
            return _match_escaped.sub(repl, s)

    A more efficient implementation in C would also be fine, this is just an easy way to define the exact semantics.

    (I also just noticed that unlike other error handlers, surrogateespace and surrogatepass do not have corresponding codecs.surrogateescape_errors and codecs.surrogatepass_errors functions)

    serhiy-storchaka commented 10 years ago

    What problem is purposed to solve clean_surrogate_escapes()? Could you please provide user scenario or two?

    Possible alternative implementation is:

    def clean_surrogate_escapes(s):
        return s.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')

    It can be faster for some data (for mostly ASCII with rare surrogates it is superfast). For other data 'utf-16' can be better choice.

    ncoghlan commented 10 years ago

    My main use case is for passing data to other applications that *don't* have their Unicode handling in order - I want to be able to use Python to do the data scrubbing, but at the moment it requires intimate knowledge of the codec error handling system to do it. (I had never even heard of surrogatepass until this evening)

    Situation:

    What I have: data decoded with surrogateescape What I want: that same data with all the surrogates gone, replaced with either the Unicode replacement character or an ASCII question mark (which I want will depend on the exact situation)

    Assume I am largely clueless about the codec system. I know nothing beyond the fact that Python 3 strings may have smuggled bytes in them and I want to get rid of them because they confuse the application I'm passing them to.

    The concrete example that got me thinking about this again was the task of writing filenames into a UTF-8 encoded email, and wanting to scrub the output from os.listdir before writing the list into the email (s/email/web page/ also works).

    For issue bpo-22016 I actually suggested doing this as *another* codec error handler ("surrogatereplace"), but Stephen Turnbull convinced me this original idea was better: it should just be a pure data transformation pass on the string, clearing the surrogates out, and leaving me with data that is identical to that I would have had if "surrogatereplace" had been used instead of "surrogateescape" in the first place.

    As "errors='replace'" already covers the "ASCII ?" replacement case, that means your proposed "redecode" based solution would cover the rest of my use case.

    vstinner commented 10 years ago

    Your clean() function looses information. If a filename contains almost only undecodable characters, it will looks like ����.txt. It's not very useful. I would prefer to escape the byte. Mac OS X (HFS+ filesystem) uses for example %HH format: "\udc80" would be replaced with "%80" for example.

    This format is also used in URLs. For example, "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt" in Firefox (when listing a local directory), but Firefox uses the URL "file://.../a%E9b.txt" (hexadecimal in uppercase).

    In the Gnome file browser (Nautilus), "a\xe9b.txt" (latin-1, whereas my locale encoding is UTF-8) is displayed "a�b.txt (invalid encoding)".

    ncoghlan commented 10 years ago

    Ideally we'd have string modification support for all the translations we offer as codec error handlers:

    The reason it's beneficial to be able to do these as string transformations rather than only in the codecs is that you may just be contributing part of the output, with the actual encoding operation handled elsewhere (e.g. you may be storing it in a data structure that will later be encoded as JSON or XML, or my earlier example of generating a list of files to be included in an email). Surrogates are great when you're just passing data straight back to the operating system. They're not so great when you're passing them on to other parts of the application as text. I'd prefer to be able to deal with them closer to the point of origin, at least in some cases.

    Now, some of these things *can* be done today using Serhiy's trick of encoding to UTF-8 and then decoding again:

    data.encode('utf-8', 'surrogatepass').decode('utf-8', 'replace')
    data.encode('utf-8', 'replace').decode('utf-8')
    data.encode('utf-8', 'ignore').decode('utf-8')

    However, these two don't work properly:

    data.encode('utf-8', 'xmlcharrefreplace').decode('utf-8')
    data.encode('utf-8', 'backslashreplace').decode('utf-8')

    The reason those don't work is because they'll encode the *surrogate escaped bytes*, rather than the originals.

    Mapping the escaped bytes to percent encoding has the same problem - you likely want to do a two step transformation (escaped surrogate -> original byte -> percent encoded value), rather than directly percent encoding the already escaped bytes.

    bitdancer commented 10 years ago

    Right now having has_escaped_bytes isn't too important, since I've done nothing to profile and improve the performance of the new email code. But eventually I'll need it, because detecting the existence of escaped bytes is inside some of the inner loops in the header processing code. "s != s.clean_escaped_bytes()" (or whatever you call it) just smells wrong as a way to spell has_escaped_bytes.

    pitrou commented 10 years ago

    data.encode('utf-8', 'replace').decode('utf-8') data.encode('utf-8', 'ignore').decode('utf-8')

    Why not the reverse:

    os.fsencode(data).decode('utf-8', 'replace') os.fsencode(data).decode('utf-8', 'ignore')

    Note that "backslashreplace" needs to be enhanced to work when decoding too. Note that "xmlcharrefreplace" doesn't make sense here: it encodes a *character* reference, but you're precisely trying to represent something which fails interpreting as a character.

    (AFAIK, XML can't represent non-text data, except in NDATA sequences)

    ncoghlan commented 10 years ago

    Note that pairing fsencode with 'utf-8' isn't guaranteed to do the right thing. It would work for the default C locale (since that's ASCII), but not in the general case.

    Enhancing backslashreplace to also work on input is an interesting idea, but worth making it's own RFE: http://bugs.python.org/issue22286

    I also agree we can ignore xmlcharrefreplace here.

    So that leaves the basic pattern as:

    data.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace') data.encode('utf-8', 'surrogateescape').decode('utf-8', 'ignore') data.encode('utf-8', 'surrogateescape').decode('utf-8', 'backslashreplace')

    This wouldn't allow the option of substituting an ASCII question mark, but I'd be OK with that.

    Possible function name and implementation:

        def convert_surrogateescape(data, errors='replace'):
            return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    
    Added bonus: pass "errors='strict'" and you'll get an exception if there were any surrogate escaped values in the string. (I take that emergent property as a sign that we're converging on a sensible design here)

    Adding a fast path for keeping track of whether or not a string contains escaped surrogates would then be a separate RFE.

    ncoghlan commented 9 years ago

    Updated issue title to reflect current proposal.

    malemburg commented 9 years ago

    Don't like the function name :-)

    How about codecs.filter_non_utf8_data(), since that's closer to what the function is really doing and doesn't require knowledge about what "surrogateescape" is.

    ncoghlan commented 9 years ago

    The error handler is called "surrogateescape". That means "convert_surrogateescape" is always only a single step away from thinking "I want to remove the smuggled bytes from a surrogateescape'd string", without needing to assume any knowledge on the part of the user other than the name of the error handler and the fact that it is used to smuggle arbitrary bytes through the Python 3 str type.

    Getting from "this string was decoded with the surrogateescape handler and may contain smuggled bytes" to "filter_non_utf8_data" as the relevant cleanup function is a much bigger leap that requires more assumed knowledge on the part of the user, and also one that confuses the conceptual purpose of the function (cleaning up the output of the surrogateescape error handler to ensure it is a pure Unicode string) with the internal details of the proposed approach to implementing that cleanup operation (encoding to UTF-8 with surrogateescape, and then decoding again with a different error handler).

    ncoghlan commented 9 years ago

    The function definition again, this time with a draft docstring:

        def convert_surrogateescape(data, errors='replace'):
            """Convert escaped raw bytes by applying a different error handler
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    ncoghlan commented 9 years ago

    Note I would also be OK with "convert_surrogates", as that's the term that appears in the relevant error message:

    >>> b'\xe9'.decode('ascii', 'surrogateescape').encode()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 0: surrogates not allowed
    pitrou commented 9 years ago

    Le 23/09/2014 12:57, Nick Coghlan a écrit :

    The function definition again, this time with a draft docstring:

    def convert_surrogateescape(data, errors='replace'):
        """Convert escaped raw bytes by applying a different error handler
    
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

    'utf-8' is hardcoded?

    ncoghlan commented 9 years ago

    Draft docstring for that version

        def convert_surrogates(data, errors='replace'):
            """Convert escaped surrogates by applying a different error handler
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
    ncoghlan commented 9 years ago

    Antoine: what would be the use case for using a different encoding for the temporary bytes object? It's discarded anyway, so the encoding used isn't externally visible.

    pitrou commented 9 years ago

    The encoding used impacts the result:

    >>> s = 'abc\udcc3\udca9'
    >>> s.encode('ascii', 'surrogateescape').decode('ascii', 'replace')
    'abc��'
    >>> s.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
    'abcé'

    The original string ('abc\udcc3\udca9') was obtained by decoding a valid utf-8 string with the 'ascii' codec and the 'surrogateescape' error handler.

    If anything, the default encoding should probably be sys.getfilesystemencoding().

    malemburg commented 9 years ago

    On 23.09.2014 13:12, Nick Coghlan wrote:

    Nick Coghlan added the comment:

    Draft docstring for that version

    def convert_surrogates(data, errors='replace'):
        """Convert escaped surrogates by applying a different error handler
    
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

    Nick, the doc string is not correct. It is not working on escaped surrogates. Instead it is working on lone surrogates that were used to encode undecodable bytes from some input data.

    The longer story goes like this:

    The "surrogateescape" error handler in the .decode() call that lead up to the data you want this function to take as input, will convert undecodable data to lone low surrogates.

    The function then reverts these bytes back into UTF-8 (which may well not be the original encoding, as Antoine has already pointed out, but that's not really important for the use case), recreating the unencodable bytes and then decodes the result again using the UTF-8 codec using a new error handler.

    So in summary, the function is supposed to retroactively apply a different error handler to the input data, undoing the effects of the "surrogateescapes" error handler.

    The name still doesn't match this functionality.

    BTW: There's a catch in the approach. The encoding used to decode the original data may well be 'ascii'. Now, if the original input data was in fact UTF-8, the input decoding would have mapped the UTF-8 code points to lone surrogates. The above function would then turn these back into UTF-8, redecode and get a completely different string back (since the error handlers would not trigger).

    I'm not sure whether adding such a small function with so many unclear implications is a good idea. Either it should be made more specific, e.g. be reserved for use on data from input streams with known encoding, or be put into the documentation as example for people to use and adapt as necessary.

    bitdancer commented 9 years ago

    And indeed my use case for this has instances of both cases: originally decoded using ASCII and the non-ascii bytes must end up as replaced characters, and originally decoded using utf-8.

    I'm also not sure that it is worth adding this. If you know what you are doing the solution is obvious, and if you don't know what you are doing you shouldn't be using surrogateescape in the first place :)

    Now, if there were or there is intended to be a more efficient C level implementation, that answer might be different.

    bitdancer commented 9 years ago

    Oh, wait, I forgot that the context for this was dealing with unix filenames and/or stdio. So, a function that just uses the fsencoding to do the replace might indeed be appropriate, but in that case should probably live in the os module. os.convert_surrogates?

    ncoghlan commented 9 years ago

    As RDM noted, avoiding the use of surrogateescape isn't feasible when we do it by default on all OS interfaces (including the standard streams when we detect 'ascii' as the filesystem encoding in 3.5+).

    This *needs* to be a case that folks can handle without needing to spend years learning about encodings and error handlers first. That means being able to tell them "use this documented function to remove the surrogates" rather than "use this magic incantation that you don't understand, and that other people may not be able to read".

    I know more about Unicode encodings than the average programmer at this point, yet I still needed to be schooled by true experts in this thread to learn how to solve the problem properly.

    Look at this as an opportunity to encapsulate that knowledge in executable form, as while the code is short, it is conceptually *very* dense.

    If there's a dedicated function, then replacing the encode/decode dance with a faster pure C alternative also becomes a future possibility (with only a recipe, there's no opportunity to ever optimise it).

    With the additional clarification, it is also clear to me that Antoine is correct that the encoding needs to be configurable and should default to the appropriate setting to remove the surrogates from OS provided data.

    With that change:

        def convert_surrogates(data, encoding=None, errors='replace'):
            """Convert escaped surrogates by applying a different error handler
        If no encoding is given, defaults to sys.getfilesystemencoding()
        Uses the "replace" error handler by default, but any input
        error handler may be specified.
        """
        if encoding is None:
            encoding = sys.getfilesystemencoding()
        return data.encode(encoding, 'surrogateescape').decode(encoding, errors)

    Since it's primarily intended for cleaning OS provided data, then I agree os.convert_surrogates() could be a good choice. It would be appropriate to reference it from os.fsdecode() as a way to clean escaped data when the original binary data was no longer available to be decoded again with a different error handler.

    serhiy-storchaka commented 9 years ago

    Good catch Antoine!

    Here is a sample of more complicated implementation.

    ncoghlan commented 9 years ago

    Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even better idea - replacing the lone surrogates directly with the output of the alternative error handler avoids any need to worry about the original encoding.

    serhiy-storchaka commented 9 years ago

    Proposed preliminary patch adds three functions in the codecs module:

    convert_surrogates(data, errors) -- handle lone surrogates with specified error handler.

    >>> codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace')
    'a€\\udca4'

    convert_surrogateescape(data, errors) -- handle surrogateescaped bytes with specified error handler

    >>> codecs.convert_surrogateescape('a\u20ac\udca4', 'backslashreplace')
    'a€\\xa4'

    convert_astrals(data, errors) -- handle astral (non-BMP) characters with specified error handler.

    >>> codecs.convert_astral('a\u20ac\U000e007f', 'backslashreplace')
    'a€\\U000e007f'

    Names are discussable.

    I think also about adding two functions or error handlers (that can used with convert_surrogates and convert_astrals) for composing astral characters from surrogate pairs and vice versa.

    ncoghlan commented 9 years ago

    (Serhiy, did you miss uploading the new patch?)

    Regarding the names, we may need to think about the use cases a bit more explicitly to clarify that in terms of the Python codecs API rather than expecting folks to understand the underlying representation. In the case of handling lone surrogates and escaped surrogates, what about:

        rehandle_surrogatepass(data, errors="strict")
        rehandle_surrogateescape(data, errors="strict")

    That is, we know we have data that was decoded with either surrogatepass or surrogateespace (respectively) as the error handler, and we want to process the results of that with a different error handler.

    I believe those two would be enough to address the specific cases this issue was raised to cover, so it may make sense to file a separate issue to discuss the use cases for the custom astral handling.

    Since astrals aren't actually errors in the first place, that could become:

        handle_astrals(data, errors="strict")

    As in "pass every astral code point in this string through the named error handler".

    The astral -> surrogate pair and surrogate pair -> astral converters do sound potentially interesting, but as noted above, I think they may call for a separate issue that better explains the specific use cases.

    serhiy-storchaka commented 9 years ago

    I uploaded the patch just before your comment Nick.

    Here is updated patch. Functions are renamed as Nick suggested, added two more functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly for example here, they can be committed in other issue.

    I hesitate about the rehandle_surrogatepass name. This function handles surrogates than can be created not only with the "surrogatepass" handler, but also with different ways, e.g. with the "surrogateescape" handler, with chr(), handle_astral() or decompose_astrals(). Actually it checks that the string is valid Unicode (not containing surrogates) and handle errors if found with specified error handler.

    May be there is a time for wider discussion on Python-Dev. I especially want to hear opinions of Ezio and Martin.

    ncoghlan commented 9 years ago

    I'd wondered about that with respect to rehandle_surrogatepass.

    The current implementation looks like it processes *all* surrogates (even valid surrogate pairs), so "handle_surrogates" might be a suitable name.

    If the intent is for it to be "handle_lone_surrogates", I'm not sure the current implementation achieves that, as a valid surrogate pair will match re.compile('[\ud800-\uefff]+').

    The rest looks OK to me, including the decompose_astrals() and compose_surrogate_pairs() functions. Regardless of any practical utility, the latter two seem useful for *educational* purposes when it comes to unicode, by making it clear how to switch between the single code point and dual code point representations of the astrals.

    ncoghlan commented 9 years ago

    Oh, and yes, I agree a python-dev discussion would be a good idea.

    From my perspective, "rehandle_surrogateescape" is the key function for making it easier to check for malformed input data from operating system interfaces.

    The other items I don't personally have a use case for, but they seem potentially valuable in make some key Unicode concepts a bit more discoverable.

    serhiy-storchaka commented 9 years ago

    Note that provided Python implementations are rather a proof of concept. After discussion I'll provide more efficient C implementations, that should be 1-2 orders faster (and infinitely fast for common case of ASCII strings).

    1f733644-4c7c-4389-9314-175dea2a2b84 commented 9 years ago

    Please do not add the "rehandle" functions to codecs. They do not change the (duck-typed) representation of data while maintaining the semantics, they change the semantics of data while retaining the representation.

    I suggest a "validation" submodule of the unicodedata package, or perhaps a new "unicodeutils" package, for these functions, as well as those that just detect the surrogates, etc.

    Because they change the semantics of data they should be documented as potentially dangerous because they can't be inverted back to bytes without knowledge of the history of transformations they perform (and not even then in the case of the "replace" error handler). This matters in applications where the input bytes may have been digitally signed, for example.

    ncoghlan commented 9 years ago

    surrogateescape and surrogateepass data *already* can't be inverted back to bytes reliably without knowing the original encoding - if you encode them as something else when they contain surrogates, you'll either get an exception (the default) or mojibake (if you use surrogateescape/surrogateepass as the output error handler). They only work as a transparent pass through if the input and output encodings match.

    I'd be fine with putting these data scrubbing functions somewhere other than in codecs, though (I'm not sure unicodedata is the right place, but a new module like "string.internals" might be, as these functions have more to do with Python's internal text representation than they do anything else. A module like the latter could also be a home for things like a chunking utility that splits a string up into substrings that use as little memory as possible for feeding into a StringIO instance before throwing the original away).

    I also don't think they're urgent - the introduction of /etc/locale.conf makes modern Linux far more consistent in getting locale settings right, and even older platforms tend to get the locale right for user processes.

    ncoghlan commented 9 years ago

    I suggest we defer this one to 3.6 - I still think it's worth doing, but I don't think it's a major barrier to migration, and it would be good to get some real world experience with the new sys.stdin behaviour of defaulting to using surrogateescape in the POSIX locale in 3.5 before committing to a particular design for the surrogate cleaning API.

    I do like the idea of a "string.internals" submodule as a possible home for exposing the Python level API.

    vadmium commented 8 years ago

    [padding]

    I think my suggested colours for the bikeshed would be handle_surrogates() and handle_surrogateescape(). “Rehandle” seems awkward and too assuming to me. And I agree with Serhiy that surrogates are a Unicode thing, not just related to the “surrogatepass” handler.

    Adding them to “codecs” makes sense to me. The most important one, handle_surrogateescape() or equivalent, is closely related to the error handler of that module.

    Having handle_surrogateescape or equivalent would probably be useful for bpo-25184 (displaying an arbitrary file path in a UTF-8 HTML file).

    vstinner commented 8 years ago

    Hum, I suggest to put these functions in a package on PyPI, or recipes on a website like stackoverfkow., and close the issue.

    I'm still not convinced that these functions are useful . Usually we take a function from an existing project used in applications to put it in the stdlib. Here the use case still looks artifical. For example which application requires to escape non-BMP character? How does it handle them currently?

    Threre are too many ways to handle surrogate characters. The common ways to show undecodable bytes are not supported by functions proposed by Serhiy. Example: %80 on Mac OS X. Gnome uses something else.

    It was said that one reason to add new functions is performance. I'm not convinced neither that such function is the bottleneck on any application.

    I prefer to wait until users experiment with their own implementation and see if a common function can be extracted from this to put it in the stdlib.

    ncoghlan commented 8 years ago

    I think moving this forward mainly needs someone with the time and energy wrangle a python-ideas/dev discussion to get some additional feedback on the API design. As I see it, there are 2 main questions to be resolved:

    1. Where to expose these functions

    The default location would be the codecs module, as they're closely related to the error handlers in that module, and the main reasons for needing to clean data at all are handling dirty data produced by an interface that uses surrogatepass or surrogateescape when decoding (handle_surrogates, handle_surrogateescape), or encoding data for use in a context which doesn't correctly handle code points outside the basic multilingual plane (handle_astrals).

    If added to the codecs module, they could be documented in new sections on "Postprocessing decoded text" and "Preprocessing text for encoding".

    The main argument against that would be Stephen's one, which is that these aren't themselves encoding or decoding operations, but rather internal state manipulations on Python strings.

    1. The exact function set to be provided.

    The three potential data cleaning cases currently being considered:

    These seem to cover the essentials to me, and I changed the proposed prefix to "process_*" based on the idea of documentating them as preprocessing and postprocessing steps for encoding and decoding.

    ncoghlan commented 8 years ago

    As far as the rationale for adding the functions at all goes, my main interest is still in having somewhere in the codecs module documentation to *define the problem*, and to my mind that entails also offering a simple way to do the relevant pre-/post-processing.

    The nice aspect of building any related capabilities atop the standard error handlers is that it also means that third party modules can provide custom error handlers to support further escaping techniques, and those will also be available for use in decoding and encoding operations, rather than being specific to pre-/post-processing of the data.

    However, it's also the case that we're generally going to be talking about the combination of encoding misconfiguration *and processing data that gets potentially corrupted by the misconfiguration *and doing something with it that isn't already handled by a surrogateescape round-trip, which is why I suspect in practice most applications are going to be able to get away with ignoring the problem entirely (especially with C.UTF-8 support coming to Fedora 24, so the Fedora/RHEL/CentOS ecosystem will be joining the Debian/Ubuntu ecosystem in offering that by default)