cmaloney commented 5 months ago

Feature or enhancement

Proposal:

I came across some seemingly redundant fstat() and lseek() calls when working on a tool that scanned a directory of lots of small YAML files and loaded their contents as config. In tracing I found most execution time wasn't in the python interpreter but system calls (on top of NFS in that case, which made some I/O calls particularly slow).

I've been experimenting with a program that reads all .rst files in the python Docs directory to try and remove some of those redundant system calls..

Test Program

from pathlib import Path

nlines = []
for filename in Path("cpython/Doc").glob("**/*.rst"):
    nlines.append(len(filename.read_text()))

In my experimentation, with some tweaks to fileio can remove over 10% of the system calls the test program makes when scanning the whole Doc folders for .rst files on both macOS and Linux (don't have a Windows machine to measure on).

Current State (9 system calls)

Currently on my Linux machine to read a whole .rst file with the above code there is this series of system calls:

openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0
ioctl(3, TCGETS, 0x7ffe52525930)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0
read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343
read(3, "", 1)                          = 0
close(3)                                = 0

Target State (7 5 system calls)

It would be nice to get it down to (for small files, large file caveat in PR / get an additional seek):

# Open the file
openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3
# Check if the open fd is a file or directory and early-exit on directories with a specialized error.
# With my changes we also stash the size information from this for later use as an estimate.
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0
# Read the data directly into a PyBytes
read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343
# Read the EOF marker
read(3, "", 1)                          = 0
# Close the file
close(3)                                = 0

In a number of cases (ex. importing modules) there is often a fstat followed immediately by an open / read the file (which does another fstat typically), but that is an extension point and I want to keep that out of scope for now.

Questions rattling around in my head around this

Some of these are likely better for Discourse / longer form discussion, happy to start threads there as appropriate.

Is there a way to add a test for certain system calls happening with certain arguments and/or a certain amount of time? (I don't currently see a great way to write a test to make sure the number of system calls doesn't change unintentionally)
Running a simple python script (python simple.py that contains print("Hello, World!")) currently reads simple.py in full at least 4 times and does over 5 seeks. I have been pulling on that thread but it interacts with importlib as well as how the python compiler currently works, still trying to get my head around. Would removing more of those overheads be something of interest / should I keep working to get my head around it?
We could potentially save more
1. with readv (one readv call, two iovecs). I avoided this for now because _Py_read does quite a bit.
2. dispatching multiple calls in parallel using asynchronous I/O APIs to meet the python API guarantees; I am experimenting with this (backed by relatively new Linux I/O APIs but possibly for kqueue and epoll), but it's very experimental and feeling a lot like "has to be a interpreter primitive" to me to work effectively which is complex to plumb through. Very early days though, many thoughts, not much prototype code.
The _blksize member of fileio was added in bpo-21679. It is not used much as far as I can tell as its reflection _blksize in python or in the code. The only usage I can find is https://github.com/python/cpython/blob/main/Modules/_io/_iomodule.c#L365-L374, where we could just query for it when needed in that case to save some storage on all fileio objects. The behavior of using the stat returned st_blksize is part of the docs, so doesn't feel like we can fully remove it.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

gh-120755
gh-121143
gh-121357
gh-121593
gh-121633
gh-122101
gh-122103
gh-122111
gh-122215
gh-122216
gh-123303
gh-123412
gh-123413
gh-124225
gh-125166
gh-126466

mdboom commented 5 months ago

Running a simple python script (python simple.py that contains print("Hello, World!")) currently reads simple.py in full at least 4 times and does over 5 seeks. I have been pulling on that thread but it interacts with importlib as well as how the python compiler currently works, still trying to get my head around. Would removing more of those overheads be something of interest / should I keep working to get my head around it?

Startup time is always an important thing to optimize for.

cmaloney commented 5 months ago

I found the source of the remaining lseek(3, 0, SEEK_CUR) by using lldb to set a break point and get a backtrace. In _io.open() (_io_open_impl), by default, buffering is on, and that results in a BufferedIO* object being constructed. All of those call _buffered_init which calls _buffered_raw_tell to figure out the position so the buffered IO can track positions itself.

In the read whole file case that seems to be a lot of unnecessary work (Allocate a additional buffer and lock, do the seek, a list of "chunks" for read that get joined w/ special fast cases for "only one chunk", etc) which gets discarded shortly after. While it is likely possible to rewrite read_bytes() and read_text() to do less work, open().read() is common enough in code I'm planning to try and make that do less work overall, trying to avoid special-casing the operations.

cmaloney commented 4 months ago

Summary

# main
read_file_large: Mean +- std dev: 24.0 us +- 0.4 us

# readall_faster, removes a stat (#120755): 
read_file_large: Mean +- std dev: 21.2 us +- 0.6 us

# No isatty call (no pr yet)
read_file_large: Mean +- std dev: 20.2 us +- 0.3 us

No patch that removes the BufferedIO/TextIO seek passes tests yet. Planning to see what happens if stash whole stat result rather than lots of individual members.

Details

It touches a lot of code to remove the remaining seek. BufferedIO has implementation for the "absolute position" abs_pos to be optional / lazily filled. Once that is removed however, TextIO checks for "is this seekable" on construction, which is a boolean member that must be filled at construction currently. Still trying for a well contained way to do that.

The isatty call can in theory have a shortcut added around it using the fstat st_mode information. Just adding a "if this is a regular file" check before it + member accessible from fileio I get a speedup on my Mac :

main

python ../benchmark.py
.....................
read_file_small: Mean +- std dev: 8.43 us +- 0.12 us
.....................
read_file_large: Mean +- std dev: 24.0 us +- 0.4 us

cmaloney/readall_faster (https://github.com/python/cpython/pull/120755)

.....................
read_file_small: Mean +- std dev: 7.92 us +- 0.07 us
.....................
read_file_large: Mean +- std dev: 21.2 us +- 0.6 us

No isatty (local only currently):

.....................
read_file_small: Mean +- std dev: 7.32 us +- 0.09 us
.....................
read_file_large: Mean +- std dev: 20.2 us +- 0.3 us

That reduces the time for reading a single large .rst file in the cpython repo ~15% so far.

I built a little larger benchmark to more match code I ran into this with: "read all the .rst and .py in the cpython repo"

import pyperf
from pathlib import Path

def read_all(all_paths):
    for p in all_paths:
        p.read_bytes()

def read_file(path_obj):
    path_obj.read_text()

all_rst = list(Path("Doc").glob("**/*.rst"))
all_py = list(Path(".").glob("**/*.py"))
assert all_rst, "Should have found .rst files"
assert all_py, "Should have found .py source files"

runner = pyperf.Runner()
runner.bench_func("read_file_small", read_file, Path("Doc/howto/clinic.rst"))
runner.bench_func("read_file_large", read_file, Path("Doc/c-api/typeobj.rst"))

runner.bench_func("read_all_rst", read_all, all_rst)
runner.bench_func("read_all_py", read_all, all_py)

And on my Linux box read_all_py is ~9% faster with both changes at the moment.

I'm iterating on top of my existing PR code. Both removing the extra fstat and isatty require getting and stashing more data out of the initial stat result, so planning to see what the code looks like to just stash the whole stat result (although it is fairly big, so possibly with a hatch to unset it on write/seek or manually). This would also be a way to remove the _blksize member which is also only used at open() time (to figure out buffer size to pass to BufferedIO*).

cmaloney commented 4 months ago

Could this be reopened? My plan is to keep working on trying to simplify what happens in this case, would like to attach to this issue (strace pr followup test, removing seeks, possibly removing isatty)

cmaloney commented 4 months ago

See also: gh-90102

cmaloney commented 3 months ago

With the set of PRs out currently, have this about as far as I think I can take it without reworking BufferedIO fairly substantially. I've removed the seek a couple different ways (ex. Trying to make it lazier, passing in a "known position"), but they tend to make the code both more complex and have no measurable impact or a negative impact on performance. For cases which know they're reading a whole file, disabling buffering is profitable (ex: Path.read_bytes() #122111), and it is likely profitable to make it possible to remove buffering for full file text reading, but I'd rather get a more efficient default I/O stack from open() rather then making the common open call more complicated most places (setting buffering=0). I may look into something like a piece tree to make BufferedIO more efficient and correct for its valuable lots of small read/write case (see: https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation). I think other techniques (ex. doing I/O operations in parallel or asynchronously), can likely make a bigger impact in more cases for common code out of the box (ex, see https://www.youtube.com/watch?v=Ul8OO4vQMTw).

My plan is to focus on getting the current set of changes out, as well as possibly talk a bit about what happens in a read() since it's a lot more complicated than at least my mental model was (ex. os.read isn't an ideal primitive compared to _Py_read). With free threading and multiple interpreters, there are some intriguing possibilities with multi-dispatch and moving to python-native coroutine I/O code (ex. every open is actually two system calls which could happen in parallel if a platform supports, with the interpreter running other code while the OS does the needed work). That needs some protying and discussion out of scope of this change set.

TL; DR: No more changes planned outside followups. If there are any particular things I can help do to move PRs along, let me know. Plan is to try and help where I can on open PRs and issues in the repository focused around the I/O stack.

gpshead commented 2 months ago

I left some investigation comments on https://github.com/python/cpython/pull/121143 regarding the test failure leading to the revert. TL;DR the system call name being tested for could also be statx not just something with fstat in the name.

cmaloney commented 2 months ago

I'll work on opening a new PR for the strace refactor + fix the issues. Thanks for the fast revert and investigation.

Check *fstat* + statx as acceptable calls. Reading through various I/O changes a bit, it's important that the just-opened FD is used to prevent TOCTOU security violations, so don't want to just reduce to stat which takes a filename. Also include a comment about that in the test code.
Update the Path().read_bytes() test case to additionally validate the reduction inisatty/ioctl + seek calls from https://github.com/python/cpython/pull/122111
Better diagnostic assertion handlers, really like your list of syscalls option / it's nice when a test fails the info to get first steps isn't first adding print debugging or a breakpoint

cmaloney commented 2 months ago

@hauntsaninja , @gpshead new PR with the strace helper applied + additional fixups from the revert: https://github.com/python/cpython/pull/123413

collinanderson commented 1 month ago

Note that the number of syscalls could theoretically go as low as 3 or 4: (open, read, read, close), if we added something like a max_read_size param to a path.read_bytes(). It doesn't help in the common case, but it can help for example when reading lots of small files (like in /proc/), if you know the file is probably going be all be less than for example 8092 bytes. Memory gets allocated for the full read size so max_read_size shouldn't be set too high when used.

(On linux at least, os.read() raises IsADirectoryError, so the fstat call isn't needed on linux to check if the file is a directory.)

collinanderson commented 1 month ago

Running a simple python script (python simple.py that contains print("Hello, World!")) currently reads simple.py in full at least 4 times and does over 5 seeks. I have been pulling on that thread but it interacts with importlib as well as how the python compiler currently works, still trying to get my head around. Would removing more of those overheads be something of interest / should I keep working to get my head around it?

I personally think it would be awesome to somehow get the syscall count reduced during import / startup time, which would improve startup time. I believe the code for reading files during import/startup is in importlib ._bootstrap_external.FileLoader.get_data().

It's tricky because io.open_code() could be hooked to be pretty much anything (so probably can't just pass in buffering=0), but ideally the file bytes should be read in full with only the 5 syscalls (no seeks).

Your isatty and readall() fixes are probably already a huge improvement, though I'm guessing there's still the extra lseek() call in readall().

(It's a little funny that the io.open_code() hook is only for opening files, when all that's done is call .read() on the open file. Ideally the hook would be more like io.read_code_bytes() which would be much easier to optimize.)

I've been using this code on linux for the last few years to cut down on import syscalls:

def faster_read_file_bytes(path: str, size=-1) -> bytes:
    fd = os.open(path, os.O_RDONLY)
    try:
        if size == -1:
            size = os.fstat(fd).st_size
        return os.read(fd, size)
    finally:
        os.close(fd)

def faster_get_data(loader: object, filepath: str) -> bytes:
    return faster_read_file_bytes(filepath)

def install_import_speedup() -> None:
    assert sys.path_hooks[-1].__closure__
    for loader, _ in sys.path_hooks[-1].__closure__[1].cell_contents:
        loader.get_data = faster_get_data

cmaloney commented 1 month ago

re: max_read_size my plan / hope as I get into BufferedIO changes is to make it so can pass in a buffer rather than a max size so that users can control buffer reuse / avoid an allocation if they want (ex. /proc/ scans like you gave). Default / no buffer passed, then the I/O code would call the allocator to get a new buffer with its reuse logic, but also enable code which focuses on scans of small files and similar to be able work really fast.

Unfortunately the _io.open() code which does the stat check for "is this a directory" doesn't have any guarantee code will call read (and so can't rely on that). One of the paths I'm looking at though is being able to do multiple system call dispatch (ex. pathlib.Path().read_bytes() knows the whole I/O pattern which will be required, and could do that sort of optimization. Also could potentially dispatch all at once on OSes with the right APIs...). There's also both the io.open_code hook, and the "is directory on open" is cloned a couple times around the codebase that add intricacies to reworking, see: https://github.com/python/cpython/blob/c6127af8685c2a9b416207e46089cee79d028b85/Modules/main.c#L393-L398 https://github.com/python/cpython/blob/c6127af8685c2a9b416207e46089cee79d028b85/Python/pylifecycle.c#L2758-L2762

Currently for me though the cost of the extra seeks (and even isatty) is quite a bit less (<5% overhead on modern macOS or Linux) than the cost of the allocations/construction of a BufferedIO (makes a new lock + buffer > 10% perf change). So focused on other directions for a bit... BufferedIO is required for TextIO which is what open() does by default (Read bytes then convert to text w/ TextIO layer).

Definitely also looking at general imports, and the lots of scanning / stats required to try and find .pyc files and the like... Many thoughts, but trying to keep this particular project scoped smallish so I can close it out.

collinanderson commented 1 month ago

the cost of the allocations/construction of a BufferedIO

Ohh maybe we could change the default io.open_code() implementation to be unbuffered? Possibly backward incompatible, but it would avoid a bunch of BufferedIO allocations during import/startup.

cmaloney commented 1 month ago

I think I can get BufferedIO to be cheap, it also shows up in reading folders with lots of .rst, .yaml, .toml, etc. files which I've run into a lot. With https://peps.python.org/pep-0686/ getting to TextIO is required to decode the utf-8 as needed, and TextIO requires BufferedIO (https://peps.python.org/pep-3116/#text-i-o, some caveats as codebase has evolved from initial design). A number of the cases in python startup do have "Reduced" IO stacks that avoid BuffferedIO w/ a custom implementation to improve startup time today :). https://github.com/python/cpython/pull/122111 shows potential overhead reduction. If people want optimization before BufferedIO changes, highly recommend using pathlib.Path().read_bytes(). BufferedIO + TextIO does also special case "Read whole file", but unfortunately still quite a bit of overhead when I/O is much faster and lower latency on 2024 hardware + OSes than when the original design and implementation happened.

collinanderson commented 1 month ago

One of the paths I'm looking at though is being able to do multiple system call dispatch (ex. pathlib.Path().read_bytes() knows the whole I/O pattern which will be required, and could do that sort of optimization.

highly recommend using pathlib.Path().read_bytes()

Also, it would be nice to get something like read_bytes and read_text directly in os or io, for those of us who feel like pathlib is heavyweight (though I do see it uses __slots__ so that's a plus).

getting to TextIO is required to decode the utf-8 as needed, and TextIO requires BufferedIO

I think in my testing in the past I've found read_bytes().decode() is often faster than opening as text, at least for normal sized files. TextIO is probably still worth it for large text files.

cmaloney commented 2 weeks ago

Planning to close this issue once the two outstanding PRs are closed. There are more improvements I think are possible, but they're large enough scoped individual changes they should likely get their own Discourse / github issue / etc. In particular I'm looking at:

A reimplementation of BufferedIO reducing overheads.
Being able to "multi-dispatch" I/O calls (Likely a third party module first). My theory is you'd mark some code with a decorator @unleash_dragons and that could rewrite lots of parallelizable dispatches through I/O into a multi-dispatch, ideally able to use things like io_uring where supported, but WaitForMultipleObjects on windows as well. Particular motivating pattern I've written a bit below

# Read full contents of a list of files
@unleash_dragons(gather_io=True)
def read_configs(all_paths):
    for p in all_paths:
        p.read_bytes()

Currently that does a lot of round-trips from Python -> OS -> Python, pipelining / parallelizing those requests (one io_uring wait call) significantly improves performance.

cmaloney commented 2 weeks ago

For the startup "read 4 times" case, I looked a bit and it's really complex to unwind for likely minimal performance (OS filecache saves quite a bit). It's a combination of supporting https://peps.python.org/pep-0263/ and the tokenizer taking a FILE* (https://github.com/python/cpython/blob/main/Parser/tokenizer/tokenizer.h#L9-L11 and https://github.com/python/cpython/blob/main/Parser/peg_api.c#L18-L40). There is a lot of simplification possible I think but it would take a lot of refactoring for no particular feature addition or large performance win, especially as that particular invocation case is already often cached/bypassed using other mechanisms (ex. .pyc files).

Some additional complexity in refactoring is that there is a long-standing hook io.open_code opens in Python's I/O then that needs to become a FILE*, and the API is somewhat exposed in the C API (https://docs.python.org/3/c-api/file.html#c.PyFile_FromFd)

vstinner commented 2 weeks ago

@cmaloney closed this as completed Nov 3, 2024

Congrats :-)

cmaloney commented 2 weeks ago

Thanks for your help in reviewing, helping me learn CPython, and getting the patches landed; sorry there were quite so many follow up patches, but think we got there :). Was measuring 3.13 vs. 3.14 main "read a whole file" performance and it's generally between 10-15% reading OS cached files on MacOS and Linux for a LTO optimized build. Nice little wins while keeping overall API guarantees / stability :)

Details

Python 3.14 commit: commit 7d7d56d8b1147a6b85e1c09d01b164df7c5c4942 (HEAD -> main, origin/main, origin/HEAD) Python 3.13 commit: commit 60403a5409ff2c3f3b07dd2ca91a7a3e096839c7 (tag: v3.13.0, v3.13.0) ## Python 3.13 ``` (pyperf_313_env) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 6.21 us +- 0.10 us ..................... read_file_large: Mean +- std dev: 18.1 us +- 0.6 us ..................... read_all_rst_bytes: Mean +- std dev: 1.14 ms +- 0.01 ms ..................... read_all_py_bytes: Mean +- std dev: 10.8 ms +- 0.1 ms ..................... read_all_rst_text: Mean +- std dev: 2.09 ms +- 0.05 ms (pyperf_313_env) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 6.22 us +- 0.09 us ..................... read_file_large: Mean +- std dev: 17.2 us +- 1.1 us ..................... read_all_rst_bytes: Mean +- std dev: 1.10 ms +- 0.01 ms ..................... read_all_py_bytes: Mean +- std dev: 11.7 ms +- 0.4 ms ..................... read_all_rst_text: Mean +- std dev: 2.00 ms +- 0.05 ms (pyperf_313_env) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 6.06 us +- 0.08 us ..................... read_file_large: Mean +- std dev: 17.0 us +- 0.6 us ..................... read_all_rst_bytes: Mean +- std dev: 1.10 ms +- 0.02 ms ..................... read_all_py_bytes: Mean +- std dev: 11.8 ms +- 0.2 ms ..................... read_all_rst_text: Mean +- std dev: 2.06 ms +- 0.06 ms (pyperf_313_env) ➜ cpython git:(heads/v3.13.0) python --version Python 3.13.0 ``` ## Python 3.14a0+ (gh-120754 done) ``` (pyperf_314_env) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 5.04 us +- 0.15 us ..................... read_file_large: Mean +- std dev: 15.1 us +- 1.0 us ..................... read_all_rst_bytes: Mean +- std dev: 820 us +- 32 us ..................... read_all_py_bytes: Mean +- std dev: 9.31 ms +- 0.24 ms ..................... read_all_rst_text: Mean +- std dev: 1.76 ms +- 0.08 ms (pyperf_314_env) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 5.02 us +- 0.08 us ..................... read_file_large: Mean +- std dev: 14.6 us +- 0.6 us ..................... read_all_rst_bytes: Mean +- std dev: 811 us +- 12 us ..................... read_all_py_bytes: Mean +- std dev: 9.33 ms +- 0.15 ms ..................... read_all_rst_text: Mean +- std dev: 1.75 ms +- 0.06 ms (pyperf_314_env) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 5.02 us +- 0.08 us ..................... read_file_large: Mean +- std dev: 15.0 us +- 0.8 us ..................... read_all_rst_bytes: Mean +- std dev: 809 us +- 12 us ..................... read_all_py_bytes: Mean +- std dev: 9.29 ms +- 0.09 ms ..................... read_all_rst_text: Mean +- std dev: 1.74 ms +- 0.02 ms (pyperf_314_env) ➜ cpython git:(main) python --version Python 3.14.0a1+ ``` # MacOS MacOS v3.13.0 ``` (pyperf_313_venv) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 9.34 us +- 0.26 us ..................... read_file_large: Mean +- std dev: 22.3 us +- 0.6 us ..................... read_all_rst: Mean +- std dev: 2.87 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 20.8 ms +- 0.3 ms (pyperf_313_venv) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 9.30 us +- 0.17 us ..................... read_file_large: Mean +- std dev: 21.9 us +- 0.2 us ..................... read_all_rst: Mean +- std dev: 2.87 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 20.8 ms +- 0.2 ms (pyperf_313_venv) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 9.22 us +- 0.10 us ..................... read_file_large: Mean +- std dev: 21.8 us +- 0.2 us ..................... read_all_rst: Mean +- std dev: 2.87 ms +- 0.04 ms ..................... read_all_py: Mean +- std dev: 20.8 ms +- 0.5 ms (pyperf_313_venv) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 9.25 us +- 0.07 us ..................... read_file_large: Mean +- std dev: 21.8 us +- 0.2 us ..................... read_all_rst: Mean +- std dev: 2.86 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 20.8 ms +- 0.3 ms ``` MacOS v3.14.0a0+ commit 34653bba644aa5481613f398153757d7357e39ea (HEAD -> main, origin/main, origin/HEAD) ``` (pyperf_314_venv) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 8.18 us +- 0.08 us ..................... read_file_large: Mean +- std dev: 21.1 us +- 0.3 us ..................... read_all_rst: Mean +- std dev: 2.58 ms +- 0.05 ms ..................... read_all_py: Mean +- std dev: 19.0 ms +- 0.2 ms (pyperf_314_venv) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 8.17 us +- 0.11 us ..................... read_file_large: Mean +- std dev: 21.1 us +- 0.3 us ..................... read_all_rst: Mean +- std dev: 2.57 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 19.1 ms +- 0.3 ms (pyperf_314_venv) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 8.16 us +- 0.07 us ..................... read_file_large: Mean +- std dev: 21.2 us +- 0.3 us ..................... read_all_rst: Mean +- std dev: 2.58 ms +- 0.05 ms ..................... read_all_py: Mean +- std dev: 19.0 ms +- 0.2 ms (pyperf_314_venv) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 8.20 us +- 0.07 us ..................... read_file_large: Mean +- std dev: 21.1 us +- 0.3 us ..................... read_all_rst: Mean +- std dev: 2.58 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 19.1 ms +- 0.4 ms ``` ```Python import pyperf from pathlib import Path all_rst = list(Path("Doc").glob("**/*.rst")) all_py = list(Path(".").glob("**/*.py")) assert all_rst, "Should have found rst files. Is the current directory a cpython checkout?" assert all_py, "Should have found python source files. Is the current directory a cpython checkout?" def read_all_bytes(all_paths): for p in all_paths: p.read_bytes() def read_all_text(all_paths): for p in all_paths: p.read_text() def read_file(path_obj): path_obj.read_text() def read_bytes(path_obj): path_obj.read_bytes() # print(f"all_rst={len(all_rst)}|all_py={len(all_py)}") # Make a fixed number across python versions. # File sizes may still vary a bit, but this helps a lot for stability, all_rst = all_rst[:200] all_py = all_py[:3000] assert len(all_rst) == 200 assert len(all_py) == 3000 runner = pyperf.Runner() runner.bench_func('read_file_small', read_file, Path("Doc/howto/clinic.rst")) runner.bench_func('read_file_large', read_file, Path("Doc/c-api/typeobj.rst")) runner.bench_func('read_all_rst_bytes', read_all_bytes, all_rst) runner.bench_func('read_all_py_bytes', read_all_bytes, all_py) # NOTE: No test to read all `.py` as text as some contains invalid utf-8 byte sequences. runner.bench_func('read_all_rst_text', read_all_text, all_rst) ```

vstinner commented 2 weeks ago

Maybe you should document this optimization at: https://docs.python.org/dev/whatsnew/3.14.html#optimizations

python / cpython

Speed up open().read() pattern by reducing the number of system calls #120754

Feature or enhancement

Proposal:

Test Program

Current State (9 system calls)

Target State (7 5 system calls)

Questions rattling around in my head around this

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Summary

Details