python / cpython

The Python programming language
https://www.python.org
Other
62.5k stars 30k forks source link

Iterating a text file by line should not implicitly disable tell #81217

Open 99ffcaa5-b43b-4e8e-a35e-9c890007b9cd opened 5 years ago

99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 5 years ago
BPO 37036
Nosy @MojoVampire, @websurfer5

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', '3.7', 'library', 'expert-IO'] title = 'Iterating a text file by line should not implicitly disable tell' updated_at = user = 'https://github.com/MojoVampire' ``` bugs.python.org fields: ```python activity = actor = 'Jeffrey.Kintscher' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)', 'IO'] creation = creator = 'josh.r' dependencies = [] files = [] hgrepos = [] issue_num = 37036 keywords = [] message_count = 2.0 messages = ['343402', '343404'] nosy_count = 2.0 nosy_names = ['josh.r', 'Jeffrey.Kintscher'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = None url = 'https://bugs.python.org/issue37036' versions = ['Python 3.7', 'Python 3.8'] ```

99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 5 years ago

TextIOWrapper explicitly sets the telling flag to 0 when .next ( textiowrapper_iternext ) is called ( https://github.com/python/cpython/blob/3.7/Modules/_io/textio.c#L2974 ), e.g. during standard for loops over the file of this form, trying to call tell raises an exception:

    with open(filename) as myfile:
        for line in myfile:
            myfile.tell()

which raises:

OSError: telling position disabled by next() call
while the effectively equivalent:

    with open(filename) as myfile:
        for line in iter(myfile.readline, ''):
            myfile.tell()

works fine.

The implementation of __next and readline is almost identical (next__ is calling readline and handling the EOF sentinel per the iterator protocol, that's all). Given they're implemented identically, I see no reason aside from nannying (discouraging slow operations like seek/tell during iteration by forbidding them on the most convenient means of iterating) to forbid tell after beginning iteration, but not after readline. Given the general Python philosophy of "we're all adults here", I don't see nannying as a good reason, which leaves the performance benefit of avoiding snapshotting as the only compelling reason to do this.

But the performance benefit is trivial; in local tests, the savings from avoiding that work is barely noticeable, per ipython microbenchmarks (on 3.7.2 Linux x86-64):

    >>> %%timeit -r5 from collections import deque; f = open('American-english.txt'); consume = deque(maxlen=0).extend
    ... f.seek(0)
    ... consume(iter(f.readline, ''))
    ...
    ...
    15.8 ms ± 38.4 μs per loop (mean ± std. dev. of 5 runs, 100 loops each)

    >>> %%timeit -r5 from collections import deque; f = open('American-english.txt'); consume = deque(maxlen=0).extend
    ... f.seek(0)
    ... next(f)  # Triggers optimization for all future read_chunk calls
    ... consume(iter(f.readline, ''))  # Otherwise iterated identically
    ...
    ...
    15.7 ms ± 98.5 μs per loop (mean ± std. dev. of 5 runs, 100 loops each)

The two blocks are identical except that the second one explicitly advances the file one line at the beginning with next(f) to flip telling to 0 so future calls to readline don't involve the snapshotting code in textiowrapper_read_chunk.

Calling consume(f) would drop the time to 9.86 ms, but that's saying more about the optimization of the raw iterator protocol over method calls than it is about the telling optimization; I used two arg iter in both cases to keep the code paths as similar as possible so the telling.

For reference, the input file was 931708 bytes (931467 characters thanks to a few scattered non-ASCII UTF-8 characters), 98569 lines long.

Presumably, the speed difference of 0.1 ms can be chalked up to the telling optimization, so removing it would increase the cost of normal iteration from 9.86 ms to 9.96 ms. That seems de minimis to my mind, in the context of text oriented I/O.

Given that, it seems like triggering this optimization via __next__ should be dropped; it's a microoptimization at best, that's mostly irrelevant compared to the overhead of text-oriented I/O, and it introduces undocumented limitations on the use of TextIOWrapper.

The changes would be to remove all use of the telling variable, and (if we want to keep the optimization for unseekable files, where at least no functionality is lost by having it), change the two tests in textiowrapper_read_chunk to test seekable in its place. Or we drop the optimization entirely and save 50+ lines of code that provide a fairly tiny benefit in any event.

99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 5 years ago

Left a dangling sentence in there:

"I used two arg iter in both cases to keep the code paths as similar as possible so the telling."

should read:

"I used iter(f.readline, '') in both cases to keep the code paths as similar as possible so the telling optimization was tested in isolation."