matiasb / python-unidiff

Unified diff python parsing/metadata extraction library
https://pypi.python.org/pypi/unidiff
MIT License
241 stars 70 forks source link

Latest release v0.7.5 does not include the fix for quoted filenames (for non ASCII filenames) #113

Open jnareb opened 11 months ago

jnareb commented 11 months ago

I was wondering why unidiff fails on changes to files with filenames that include characters outside 7-bit ASCII, and it turns out that the latest release v0.7.5 does not include commit 2771a87 (Support quoted filenames, 2023-06-02).

Could we please get a new release with this fix included?

Thanks in advance.

matiasb commented 10 months ago

Will prepare a release in the upcoming days :+1:

jnareb commented 10 months ago

Unfortunately, commit 2771a878f7bc6619e625feb4dbad3427f57f5237 does not fully solve the problem of c-style quoted filenames.

It makes unidiff to be able to parse patch with quoted filenames, but it then reproduces those filenames in their original quoted format. Shouldn't unidiff decode such filename to str if possible, to bytes if not (e.g. invalid UTF-8)?

All the code does it makes unidiff be able to remove "a/" or "b/" prefix from filenames even if they are in their c-quoted form.

jnareb commented 10 months ago

Here is a bit ugly code that actually tries to decode c-quoted filename; not tested for Python 2

def decode_c_quoted_str(text):
    """C-style name unquoting

    See unquote_c_style() function in 'quote.c' file in git/git source code
    https://github.com/git/git/blob/master/quote.c#L401

    This is subset of escape sequences supported by C and C++
    https://learn.microsoft.com/en-us/cpp/c-language/escape-sequences

    :param str text: string which may be c-quoted
    :return: decoded string
    :rtype: str
    """
    # TODO?: Make it a global variable
    escape_dict = {
        'a': '\a',  # Bell (alert)
        'b': '\b',  # Backspace
        'f': '\f',  # Form feed
        'n': '\n',  # New line
        'r': '\r',  # Carriage return
        't': '\t',  # Horizontal tab
        'v': '\v',  # Vertical tab
    }

    quoted = text.startswith('"') and text.endswith('"')
    if quoted:
        text = text[1:-1]  # remove quotes

        buf = bytearray()
        escaped = False  # TODO?: switch to state = 'NORMAL', 'ESCAPE', 'ESCAPE_OCTAL'
        oct_str = ''

        for ch in text:
            if not escaped:
                if ch != '\\':
                    buf.append(ord(ch))
                else:
                    escaped = True
                    oct_str = ''
            else:
                if ch in ('"', '\\'):
                    buf.append(ord(ch))
                    escaped = False
                elif ch in escape_dict:
                    buf.append(ord(escape_dict[ch]))
                    escaped = False
                elif '0' <= ch <= '7':  # octal values with first digit over 4 overflow
                    oct_str += ch
                    if len(oct_str) == 3:
                        byte = int(oct_str, base=8)  # byte in octal notation
                        if byte > 256:
                            raise ValueError(f'Invalid octal escape sequence \\{oct_str} in "{text}"')
                        buf.append(byte)
                        escaped = False
                        oct_str = ''
                else:
                    raise ValueError(f'Unexpected character \'{ch}\' in escape sequence when parsing "{text}"')

        if escaped:
            raise ValueError(f'Unfinished escape sequence when parsing "{text}"')

        text = buf.decode()

    return text