python / cpython

The Python programming language
https://www.python.org
Other
62.39k stars 29.96k forks source link

Tweak doctest 'example' regex to allow a leading ellipsis in 'want' line #80895

Open 3c565929-31be-498b-b929-dd4817d430a0 opened 5 years ago

3c565929-31be-498b-b929-dd4817d430a0 commented 5 years ago
BPO 36714
Nosy @pfmoore, @bskinn

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', 'type-feature', 'library'] title = "Tweak doctest 'example' regex to allow a leading ellipsis in 'want' line" updated_at = user = 'https://github.com/bskinn' ``` bugs.python.org fields: ```python activity = actor = 'bskinn' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'bskinn' dependencies = [] files = [] hgrepos = [] issue_num = 36714 keywords = [] message_count = 6.0 messages = ['340788', '340918', '349289', '349304', '349308', '349327'] nosy_count = 2.0 nosy_names = ['paul.moore', 'bskinn'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue36714' versions = ['Python 3.8'] ```

3c565929-31be-498b-b929-dd4817d430a0 commented 5 years ago
doctest requires code examples have PS1 as ">>> " and PS2 as "... " -- that is, each is three printed characters, followed by a space:
$ cat ell_err.py
import doctest

class Foo:
    """Test docstring.

    >>>print("This is a test sentence.")
    ...a test...

    """

doctest.run_docstring_examples(
    Foo(),
    {},
    optionflags=doctest.ELLIPSIS,
)

$ python3.8 --version
Python 3.8.0a3
$ python3.8 ell_err.py
Traceback (most recent call last):
    ...
ValueError: line 3 of the docstring for NoName lacks blank after >>>: '    >>>print("This is a test sentence.")'

$ cat ell_print.py
import doctest

class Foo:
    """Test docstring.

    >>> print("This is a test sentence.")
    ...a test...

    """

doctest.run_docstring_examples(
    Foo(),
    {},
    optionflags=doctest.ELLIPSIS,
)

$ python3.8 ell_print.py
Traceback (most recent call last):
    ...
ValueError: line 4 of the docstring for NoName lacks blank after ...: '    ...a test...'

AFAICT, this behavior is consistent across 3.4.10, 3.5.7, 3.6.8, 3.7.3, and 3.8.0a3.

However, in this ell_print.py above, that "PS2" line isn't actually meant to be a continuation of the 'source' portion of the example; it's meant to be the output (the 'want') of the example, with a leading ellipsis to be matched per doctest.ELLIPSIS rules.

The regex currently used to look for the 'source' of an example is (https://github.com/python/cpython/blob/4f5a3493b534a95fbb01d593b1ffe320db6b395e/Lib/doctest.py#L583-L586):

(?P<source>
    (?:^(?P<indent> [ ]*) >>>    .*)    # PS1 line
    (?:\n           [ ]*  \.\.\. .*)*)  # PS2 lines
\n?

Since this pattern is compiled with re.VERBOSE (https://github.com/python/cpython/blob/4f5a3493b534a95fbb01d593b1ffe320db6b395e/Lib/doctest.py#L592), the space-as-fourth-character in PS1/PS2 is not explicitly matched.

I propose changing the regex to:

(?P<source>
    (?:^(?P<indent> [ ]*) >>>[ ]    .*)    # PS1 line
    (?:\n           [ ]*  \.\.\.[ ] .*)*)  # PS2 lines
\n?

This will then explicitly match the trailing space of PS1; it shouldn't break any existing doctests, because the parsing code lower down has already been requiring that space to be present in PS1, as shown for ell_err.py above.

This will also require an explicit trailing space to be present in order for a line starting with three periods to be interpreted as a PS2 line of 'source'; otherwise, it will be treated as part of the 'want'. I made this change in my local user install of 3.8's doctest.py, and it works as I expect on ell_print.py, passing the test:

$ python3.8 ell_print.py
$
$ cat ell_wrongprint.py
import doctest

class Foo:
    """Test docstring.

    >>> print("This is a test sentence.")
    ...a foo test...

    """

doctest.run_docstring_examples(
    Foo(),
    {},
    optionflags=doctest.ELLIPSIS,
)

$ python3.8 ell_wrongprint.py
**********************************************************************
File "ell_wrongprint.py", line ?, in NoName
Failed example:
    print("This is a test sentence.")
Expected:
    ...a foo test...
Got:
    This is a test sentence.

For completeness, the following piece of regex in the 'want' section (https://github.com/python/cpython/blob/4f5a3493b534a95fbb01d593b1ffe320db6b395e/Lib/doctest.py#L589):

    (?![ ]*>>>)  # Not a line starting with PS1

should probably also be changed to:

    (?![ ]*>>>[ ])  # Not a line starting with PS1

I would be happy to put together a PR for this; I would plan to take a \~TDD style approach, implementing a few tests first and then making the regex change.

3c565929-31be-498b-b929-dd4817d430a0 commented 5 years ago

Ahh, this *will* break some doctests: any with blank PS2 lines in the 'source' portion without the explicit trailing space:

1] >>> def foo():
2] ...    print("bar")
3] ...
4] ...    print("baz")
5] >>> foo()
6] bar
7] baz

If line 3 contains exactly "..." instead of starting with "... ", it will not be recognized as a PS2 line and the example will be parsed as:

'source'
>>> def foo():
...    print("bar")

'want' ... ... print("baz")

IMO this isn't a *terribly* unreasonable tradeoff, though -- it would enable the specific ellipsis use-case as in the OP, at the cost of breaking some doctests, which shouldn't(?) be in any critical paths?

pfmoore commented 5 years ago

It shouldn't be hard to update the regex to accept either "... " followed by other text or "..." on a line on its own, surely?

3c565929-31be-498b-b929-dd4817d430a0 commented 5 years ago

Mm, agreed--that regex wouldn't be hard to write.

The problem is, AFAICT there's irresolvable syntactic ambiguity in a line starting with exactly three periods, if the doctest PS2 specification is not constrained to be exactly "... ". In such a case, "..." could mark either (1) an ellipsis standing in for an entire line of 'want', or (2) a PS2, marking a blank line in 'source'.

I don't really think aggressive lookahead would help much -- an arbitrary number of following lines could contain exactly "...", and the intended transition from 'source' to 'want' could lie at any one of them. The nonrecursive nature of regex is unhelpful here, but I don't think one could even write a recursive-descent parser, or similar, that could be 100% reliable on a single comparison. It would have to test the string against all the various splits between 'source' and 'want' along those "..." lines, and see if any match. Hairy mess.

AFAICT, defining "... " as PS2, and "..." as 'ellipsis representing a whole line' is the cleanest solution from a logical point of view.

Of course, then it's *visually* confusing, because trailing space. ¯\(ツ)

3c565929-31be-498b-b929-dd4817d430a0 commented 5 years ago

I suppose one alternative solution might be to tweak the ELLIPSIS feature of doctest, such that it would interpret a run of >=3 periods in a row (matching regex pattern of "[.]{3,}") as 'ellipsis'.

The regex for PS2 could then have a negative lookahead added, so that it *only* matches three periods, plus optionally other content: '\.\.\.(?!\.)'

That way, a line like "... foo" would retain the current meaning of "'source' line, consisting of PS2 plus the identifier 'foo'", but the meaning of "arbitrary content followed by ' foo'" could be achieved by ".... foo", since the leading "...." would NOT match the negative lookahead for PS2.

In other situations, where "..." is *not* the leading non-whitespace content, the old behavior suffices: the PS2 regex won't match anyways, so it'll be left for ELLIPSIS to process.

3c565929-31be-498b-b929-dd4817d430a0 commented 5 years ago

On reflection, it would probably be better to limit the ELLIPSIS to 3 or 4 periods ('[.]{3,4}'); otherwise, it would be impossible to express an ellipsis followed by a period in a 'want'.