python / cpython

The Python programming language
https://www.python.org
Other
62.65k stars 30.05k forks source link

Documentation for format units starting with 'e' is inconsistent #68168

Open larryhastings opened 9 years ago

larryhastings commented 9 years ago
BPO 23980
Nosy @malemburg, @birkenfeld, @vstinner, @larryhastings, @benjaminp, @vadmium, @serhiy-storchaka, @ZackerySpytz
PRs
  • python/cpython#19610
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/larryhastings' closed_at = None created_at = labels = ['3.8', 'type-feature', '3.7', '3.9', 'docs'] title = "Documentation for format units starting with 'e' is inconsistent" updated_at = user = 'https://github.com/larryhastings' ``` bugs.python.org fields: ```python activity = actor = 'ZackerySpytz' assignee = 'larry' closed = False closed_date = None closer = None components = ['Documentation'] creation = creator = 'larry' dependencies = [] files = [] hgrepos = [] issue_num = 23980 keywords = ['patch'] message_count = 8.0 messages = ['241292', '241298', '241307', '241313', '241318', '241319', '286430', '286699'] nosy_count = 8.0 nosy_names = ['lemburg', 'georg.brandl', 'vstinner', 'larry', 'benjamin.peterson', 'martin.panter', 'serhiy.storchaka', 'ZackerySpytz'] pr_nums = ['19610'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue23980' versions = ['Python 3.7', 'Python 3.8', 'Python 3.9'] ```

    larryhastings commented 9 years ago

    Documentation is here:

    https://docs.python.org/3/c-api/arg.html#arg-parsing

    The first line of documentation for each format unit follows this convention: formatunit (pythontype) [arguments, to, pyarg_parsetuple]

    These represent the format unit itself, followed by the Python type it consumes in parentheses, followed by the C types it outputs in square brackets. Thus i (int) [int] means the format unit is 'i', it consumes a Python 'int', and it produces a C 'int'. Similarly, s (str) [const char *] means the format unit is 's', it consumes a Python 'str', and it produces a C 'const char *'.

    When you call PyArg_ParseTuple (AndKeywords), you pass in a pointer to the thing you expect. If it gives you an int, you pass in &my_int. So the type of the expression you pass in for 'i' is actually "int *". And the type you pass in for 's' is actually "char **".

    The format units that deal with encodings are a bit weirder. You actually pass in a const char string first, followed by the buffer you want to write data too. Technically the types of the values you pass in for "es" are "const char *, char **". But the documentation for es says es (str) [const char *encoding, char *\buffer]

    This led me to believe that I actually had to pass in a "char ***" for buffer! Which is wrong and doing so makes your programs explode-y.

    The documentation should

    My suspicion is that the things in brackets have to be the precise C type, e.g. "int *" for i, "char **" for s, "const char *, char **" for es.

    malemburg commented 9 years ago

    On 17.04.2015 01:38, Larry Hastings wrote:

    Documentation is here:

    https://docs.python.org/3/c-api/arg.html#arg-parsing

    The first line of documentation for each format unit follows this convention: formatunit (pythontype) [arguments, to, pyarg_parsetuple]

    These represent the format unit itself, followed by the Python type it consumes in parentheses, followed by the C types it outputs in square brackets. Thus i (int) [int] means the format unit is 'i', it consumes a Python 'int', and it produces a C 'int'. Similarly, s (str) [const char *] means the format unit is 's', it consumes a Python 'str', and it produces a C 'const char *'.

    When you call PyArg_ParseTuple (AndKeywords), you pass in a pointer to the thing you expect. If it gives you an int, you pass in &my_int. So the type of the expression you pass in for 'i' is actually "int *". And the type you pass in for 's' is actually "char **".

    The format units that deal with encodings are a bit weirder. You actually pass in a const char string first, followed by the buffer you want to write data too. Technically the types of the values you pass in for "es" are "const char *, char **". But the documentation for es says es (str) [const char *encoding, char *\buffer]

    You need to pass in a variable which will then be set up to point to a buffer which will be written too :-)

    The "e" variants (typically) allocate a buffer for you, since it's pretty much unknown how long the encoded data will be.

    This led me to believe that I actually had to pass in a "char ***" for buffer! Which is wrong and doing so makes your programs explode-y.

    Indeed :-)

    The documentation should

    • explain this first-line convention precisely, and

    • use the types consistently.

    My suspicion is that the things in brackets have to be the precise C type, e.g. "int *" for i, "char **" for s, "const char *, char **" for es.

    The paragraph under "Parsing argument" says:

    """ In the following description, the quoted form is the format unit; the entry in (round) parentheses is the Python object type that matches the format unit; and the entry in [square] brackets is the type of the C variable(s) whose address should be passed. """

    So I guess the "e" descriptions need to have the additional * removed or the paragraph has to be updated and all other listings need to be converted to precise types (that would be my preference).

    I wonder why no one has noticed in all these years. I apparently had understood the listings back to be precise C types back in the days I added the documentation for the "e" codes: https://hg.python.org/cpython-fullhistory/rev/3ae06c57d09e).

    The descriptions for the codes do clarify what is going on, though.

    larryhastings commented 9 years ago

    The "e" variants (typically) allocate a buffer for you, since it's pretty much unknown how long the encoded data will be.

    All four will do it if you pass in a NULL pointer. "es#" and "et#" can reuse an existing buffer, because you can also pass in its size.

    So I guess the "e" descriptions need to have the additional * removed or the paragraph has to be updated and all other listings need to be converted to precise types (that would be my preference).

    Here's the problem with removing the "additional *". The first argument to encoding is a static string. You literally pass in the char * string, not a pointer to a variable containing the address of the string. e.g.

      PyArg_ParseTuple("es", args, "utf-8", &buffer);

    So how do we annotate that? [char, char *]?

    I wonder why no one has noticed in all these years.

    Because nobody ever freaking uses the "e" format units. But by golly I'm going to see that the docs are right and Clinic supports them correctly--or die trying.

    larryhastings commented 9 years ago

    One more idea. We annotate with an & when you pass in a pointer to a variable. So format unit 'i' would get [& int], 's' would get [& char *], and 'es#' would get [char *, & char *, & int].

    malemburg commented 9 years ago

    On 17.04.2015 06:42, Larry Hastings wrote:

    Larry Hastings added the comment:

    One more idea. We annotate with an & when you pass in a pointer to a variable. So format unit 'i' would get [& int], 's' would get [& char *], and 'es#' would get [char *, & char *, & int].

    That sounds like a nice solution.

    malemburg commented 9 years ago

    On 17.04.2015 05:50, Larry Hastings wrote:

    Larry Hastings added the comment:

    > The "e" variants (typically) allocate a buffer for you, since it's pretty > much unknown how long the encoded data will be.

    All four will do it if you pass in a NULL pointer. "es#" and "et#" can reuse an existing buffer, because you can also pass in its size.

    Right.

    > So I guess the "e" descriptions need to have the additional * removed > or the paragraph has to be updated and all other listings need > to be converted to precise types (that would be my preference).

    Here's the problem with removing the "additional *". The first argument to encoding is a static string. You literally pass in the char * string, not a pointer to a variable containing the address of the string. e.g.

    PyArg_ParseTuple("es", args, "utf-8", &buffer);

    So how do we annotate that? [char, char *]?

    A literal static string is really a pointer as well. The compiler will allocate and initialize the string and then provide a pointer to reference it. It is also possible to pass in the pointer directly - it just needs to be a const char *, i.e. one who's target value doesn't change.

    So for "es" this would have to be [const char, char *], but I like your & idea more: [& const char *, & char **]

    > I wonder why no one has noticed in all these years.

    Because nobody ever freaking uses the "e" format units. But by golly I'm going to see that the docs are right and Clinic supports them correctly--or die trying.

    It's not used much, but there are a few cases in the posixmodule and the Mac modules of the Python 2.7 stdlib, as well as in extension modules, of course. The Python 3.4 stdlib does not seem to use them anymore (the uses were replaced with custom solutions or the modules removed).

    I guess people never found out about those parser markers or simply always fetch Unicode as object rather than as encoded string.

    vadmium commented 7 years ago

    The O! and O& units are in a similar situation. They just use a different font and descriptive name, rather than a specific type:

    O! (object) [typeobject, PyObject *] O& (object) [converter, anything]

    Following this lead, you could write:

    ``es`` (:class:`str`) [encoding, char \buffer] ``et`` (. . .) [encoding, char \*buffer] ``es#`` (:class:`str`) [encoding, char \*buffer, int buffer_length] ``et#`` (. . .) [encoding, char \buffer, int buffer_length]

    The text description should explain what *encoding* is, but it appears it may already do that well enough.

    serhiy-storchaka commented 7 years ago

    LGTM.