python / cpython

The Python programming language
https://www.python.org
Other
63.46k stars 30.39k forks source link

Bytes version of sys.argv #53022

Closed vstinner closed 10 years ago

vstinner commented 14 years ago
BPO 8776
Nosy @malemburg, @loewis, @rhettinger, @amauryfa, @ncoghlan, @vstinner, @ezio-melotti
Files
  • argvb.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/ncoghlan' closed_at = created_at = labels = ['interpreter-core', 'type-feature', 'expert-unicode'] title = 'Bytes version of sys.argv' updated_at = user = 'https://github.com/vstinner' ``` bugs.python.org fields: ```python activity = actor = 'ncoghlan' assignee = 'ncoghlan' closed = True closed_date = closer = 'ncoghlan' components = ['Interpreter Core', 'Unicode'] creation = creator = 'vstinner' dependencies = [] files = ['19355'] hgrepos = [] issue_num = 8776 keywords = [] message_count = 16.0 messages = ['106140', '106141', '106143', '106172', '111754', '111757', '111770', '111818', '111819', '119255', '119528', '133608', '217377', '217408', '217416', '217475'] nosy_count = 8.0 nosy_names = ['lemburg', 'loewis', 'rhettinger', 'amaury.forgeotdarc', 'ncoghlan', 'vstinner', 'ezio.melotti', 'Arfrever'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'needs patch' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue8776' versions = ['Python 3.5'] ```

    vstinner commented 14 years ago

    In some situations, the encoding of the command line is incorrect or unknown. sys.argv is decoded with the file system encoding which can be wrong. Eg. see issue bpo-4388 (ok, it's a bug, it should be fixed).

    As os.environb, it would be useful to have bytes version of sys.argv to have able to decide the encoding used to decode each argument, or to manipulate bytes if we don't care about the encoding.

    See also issue bpo-8775 which propose to add a new encoding to decode sys.argv.

    amauryfa commented 14 years ago

    sys.argv is decoded with the file system encoding

    IIRC this is not exact. Py_Main signature is Py_Main(int argc, wchar_t **argv) then PyUnicode_FromWideChar is used, and there is no conversion (except from UCS4 to UCS2). The wchar_t strings themselves are built with mbstowcs(), the file system encoding is not used.

    vstinner commented 14 years ago

    The wchar_t strings themselves are built with mbstowcs(), the file system encoding is not used.

    Oops sorry, you are right, and it's worse :-) sys.argv is decoded using the locale encoding, but subprocess & cie use the file system encoding for the reverse operation. => it doesn't work if both encodings are different (bpo-4388, bpo-8775).

    The pseudo-code to create sys.argv on Unix is:

     # argv is a bytes list
     encoding = locale.getpreferredencoding()
     sys.argv = [arg.decode(encoding, 'surrogateescape') for arg in argv]
    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

    As os.environb, it would be useful to have bytes version of sys.argv to have able to decide the encoding used to decode each argument, or to manipulate bytes if we don't care about the encoding.

    -1. Py_Main expects wchar_t*, so no byte-oriented representation of the command line is readily available.

    vstinner commented 14 years ago

    "no byte-oriented representation of the command line is readily available."

    Why not using the following recipe?

     encoding = locale.getpreferredencoding()
     sys.argvb = [arg.decode(encoding, 'surrogateescape') for arg in argv]
    vstinner commented 14 years ago

    You should read .encode(), not .decode() :-/

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

    Using that approach would work on POSIX systems.

    Another problem I see is synchronizing the two. If some function strips arguments from sys.argv (because it has completed processing), sys.argvb would still keep the arguments. Of course, this could be fixed by having sys.argvb be a dynamic list (i.e. a sequence object) instead.

    vstinner commented 14 years ago

    Using that approach would work on POSIX systems.

    As os.environb, I think that sys.argv should not exist on Windows.

    Another problem I see is synchronizing the two

    os.environ and os.environb are synchronized. It would be possible to do the same with sys.argv and sys.argvb. The implement would be simplier because it's just a list, not a dict.

    malemburg commented 14 years ago

    STINNER Victor wrote:

    STINNER Victor \victor.stinner@haypocalc.com\ added the comment:

    > Using that approach would work on POSIX systems.

    As os.environb, I think that sys.argv should not exist on Windows.

    > Another problem I see is synchronizing the two

    os.environ and os.environb are synchronized. It would be possible to do the same with sys.argv and sys.argvb. The implement would be simplier because it's just a list, not a dict.

    +1 on adding sys.argvb for systems that use char* in main().

    vstinner commented 14 years ago

    Since r85765 (issue bpo-4388), always use UTF-8 to decode the command line arguments on Mac OS X, not the locale encoding. Which means that the pseudo-code becomes:

     if os.name != 'nt':
         if sys.platform == 'darwin':
            encoding = 'utf-8'
         else:
            encoding = locale.getpreferredencoding()
         sys.argvb = [arg.decode(encoding, 'surrogateescape') for arg in sys.argv]

    sys.argvb should be synchronized with sys.argv, as os.environb with os.environ.

    vstinner commented 14 years ago

    Prototype (in Python) of argvb.py. Try it with: ./python -i argvb.py.

    It's not possible to create sys.argvb in Python in a module loaded by Py_Initialize(), because sys.argv is created after Py_Initialize().

    vstinner commented 13 years ago

    One year after opening the issue, I don't have any real use case. And there are technical issues to implement this feature, so I prefer just to close this issue. Reopen it if you really want it, but please give an use case ;-)

    ncoghlan commented 10 years ago

    I'd like to revisit this after PEP-432 is in place, since having to do this dance for arg processing when running on Linux in the POSIX locale is somewhat lame:

        argv = sys.argv
        encoding = locale.getpreferredencoding() # Hope nobody changed the locale!
        fixed_encoding = read_encoding_from("/etc/locale.conf") # For example
        argvb = [arg.encode(encoding, "surrogateescape") for arg in argv]
        fixed_argv = [arg.decode(fixed_encoding, "surrogateescape") for arg in argvb]

    (For stricter parsing, leave out the second "surrogateescape")

    Now, if PEP-432 resolves the system encoding issue such that we are able to use the right encoding even when locale.getpreferredencoding() returns the wrong answer, then it may not be worthwhile to also provide sys.argvb (especially since it won't help hybrid 2/3 code). On the other hand, like os.environb, it does make it easier for POSIX-only code paths that wants to handle boundary encoding issues directly to stick with consuming the binary data directly and avoid the interpreter's automatic conversion to the text domain.

    Note also that os.environb is only available when os.supports_bytes_environ is True, so it would make sense to only provide sys.argvb in the circumstances where we provide os.environb.

    rhettinger commented 10 years ago

    Without commenting on this specific proposal, I would like to make an overall observation that Python is impairing its usability by adding too-many-ways-to-it in a number of categories (file descriptor variants of file methods, multiple versions of time.time, byte variants of everything that is done with strings). Python 3 was intended to be a cleaner, more learnable version of Python. Instead, it is growing enums, multiple dispatch, and multiple variants of every function. Professional programmers can be well served by some of the these tools, but the Python universe is much larger than that and the other users are not being well served by these additions (too many choices impairs usability and learnability).

    vstinner commented 10 years ago

    Today I regret os.environb (I added it). If I remember correctly, os.environb was added before the PEP-383 (surrogateescape). This PEP makes os.environb almost useless. In Python 3, Unicode is the natural choice, and thanks to the PEP-383, it's still possible to use any "raw bytes".

    argvb can be computed in one line: list(map(os.fsencode, sys.argv)).

    I now suggest to close this issue as wontfix.

    ncoghlan commented 10 years ago

    Makes sense to me. Assuming we eventually manage to resolve the POSIX locale issue, the bytes variant will become even less useful.