Add prefetch() for Buffered IO (experiment)

python / cpython

The Python programming language

https://www.python.org

Other

63.36k stars 30.34k forks source link

Add prefetch() for Buffered IO (experiment) #56262

Open 23dad19e-9ca4-43ba-8483-d783d18dcb4c opened 13 years ago

23dad19e-9ca4-43ba-8483-d783d18dcb4c commented 13 years ago

BPO	12053
Nosy	@pitrou, @vstinner, @benjaminp, @vadmium, @serhiy-storchaka
Files	issue12053-pyio.patch: _pyio prefetch() implementation issue12053-tests.patch: test cases prefetch.patch: C impl + pyio + tests

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'expert-IO'] title = 'Add prefetch() for Buffered IO (experiment)' updated_at = user = 'https://bugs.python.org/jcon' ``` bugs.python.org fields: ```python activity = actor = 'pitrou' assignee = 'none' closed = False closed_date = None closer = None components = ['IO'] creation = creator = 'jcon' dependencies = [] files = ['22168', '22169', '23308'] hgrepos = [] issue_num = 12053 keywords = ['patch'] message_count = 6.0 messages = ['135731', '137143', '138118', '144848', '240247', '240672'] nosy_count = 8.0 nosy_names = ['pitrou', 'vstinner', 'nadeem.vawda', 'benjamin.peterson', 'stutzbach', 'jcon', 'martin.panter', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue12053' versions = ['Python 3.3', 'Python 3.4'] ```

23dad19e-9ca4-43ba-8483-d783d18dcb4c commented 13 years ago

A prefetch() method for Buffered IO may greatly assist 3rd party buffering among other gains. If nothing else, it is worth experimenting with.

Discussion on the topic is here: http://mail.python.org/pipermail/python-ideas/2010-September/008180.html

A summary of the method proposed (by Antoine Pitrou):

prefetch(self, buffer, skip, minread)

Skip skip bytes from the stream. Then, try to read at least minread bytes and write them into buffer. The file pointer is advanced by at most skip + minread, or less if the end of file was reached. The total number of bytes written in buffer is returned, which can be more than minread if additional bytes could be prefetched (but, of course, cannot be more than len(buffer)).

Arguments:

buffer: a writable buffer (e.g. bytearray)
skip: number of bytes to skip (must be >= 0)
minread: number of bytes to read (must be >= 0 and <= len(buffer))

23dad19e-9ca4-43ba-8483-d783d18dcb4c commented 13 years ago

I started a draft in python. I am attaching the _pyio version along with tests. I will continue work on the C implementation and eventually documentation if this is well received. It seems straightforward, I am interested to see what you guys think.

Also, there are now 2 places which use hasattr(self, "peek"). I was wondering if it would make sense to add peek() to BufferedIOBase and raise UnsupportedOperation or return b"".

Some benchmarks..

$ ./python -m timeit -s "from _pyio import open;f = open('LICENSE', 'rb'); b=bytearray(128)" 'while f.prefetch(b, 4, 4): pass'
_pyio.BufferedIOBase.prefetch:
100000 loops, best of 3: 10.6 usec per loop
_pyio.BufferedReader.prefetch:
100000 loops, best of 3: 6 usec per loop

$ ./python -m timeit -s "from _pyio import open;f = open('LICENSE', 'rb');b=bytearray(4);" 'while f.read(4): f.readinto(b)'
100000 loops, best of 3: 5.07 usec per loop

pitrou commented 13 years ago

I started a draft in python. I am attaching the _pyio version along with tests. I will continue work on the C implementation and eventually documentation if this is well received. It seems straightforward, I am interested to see what you guys think.

Thank you. I think performance measurements are prematurate until we have an optimized C implementation anyway.

I think ultimately we also want a default implementation of read(), peek() and read1() which uses prefetch(), so that BufferedReader implementations only have to implement prefetch(). (care must be taken to avoid infinite loops)

That said, I think the python-dev mailing-list needs to be convinced of the usefulness of prefetch() (if it was only me, there wouldn't be any problem :-)). Perhaps you want to run another discussion there.

23dad19e-9ca4-43ba-8483-d783d18dcb4c commented 13 years ago

Here is an update with the C implementation. I think a working prototype will be helpful before another round on python-dev.

I'm not sure how to handle unseekable, non-blocking streams where the read returns before skip bytes are exhausted. If prefetch() returns 0, then the caller would then have to use tell() to ensure subsequent reads are sane. In other words it seems prefetch() will leave the stream in an unpredictable state. Antoine, what are your thoughts?

vadmium commented 9 years ago

Sounds like this might be more appropriate for the BufferedReader and related classes, and less so for the writer and abstract base class.

The proposed API seems strange to me. Is there an illustration of how it might be used? I suspect it wouldn’t be all that useful, and could more or less be implemented with the existing methods:

def prefetch(buffered_reader, buffer, skip, minread):
    buffered_reader.read(skip)
    consumed = buffered_reader.readinto(buffer[:minread])
    if consumed < minread:
        return consumed
    spare = len(buffer) - consumed
    extra = buffered_reader.peek(spare)[:spare]
    total = consumed + len(extra)
    buffer[consumed:total] = extra
    return total

Maybe it would be better to focus on clarifying or redefining the existing peek() method (bpo-5811), rather than making a brand new do-everything method which only seems to do what the other methods already do.

pitrou commented 9 years ago

The proposed API seems strange to me. Is there an illustration of how it might be used? I suspect it wouldn’t be all that useful, and could more or less be implemented with the existing methods:

True, but having it a Buffered method would allow it to optimize buffer usage and avoid some memory copies (the first read() call and the peek() call, for example).

In any case, Guido was rather against this proposal, so I'm not sure there's much sense in keeping discussing it.