Closed cmaloney closed 2 weeks ago
Running a simple python script (python simple.py that contains print("Hello, World!")) currently reads simple.py in full at least 4 times and does over 5 seeks. I have been pulling on that thread but it interacts with importlib as well as how the python compiler currently works, still trying to get my head around. Would removing more of those overheads be something of interest / should I keep working to get my head around it?
Startup time is always an important thing to optimize for.
I found the source of the remaining lseek(3, 0, SEEK_CUR)
by using lldb to set a break point and get a backtrace. In _io.open()
(_io_open_impl
), by default, buffering is on, and that results in a BufferedIO*
object being constructed. All of those call _buffered_init
which calls _buffered_raw_tell
to figure out the position so the buffered IO can track positions itself.
In the read whole file case that seems to be a lot of unnecessary work (Allocate a additional buffer and lock, do the seek, a list of "chunks" for read that get joined w/ special fast cases for "only one chunk", etc) which gets discarded shortly after. While it is likely possible to rewrite read_bytes()
and read_text()
to do less work, open().read()
is common enough in code I'm planning to try and make that do less work overall, trying to avoid special-casing the operations.
# main
read_file_large: Mean +- std dev: 24.0 us +- 0.4 us
# readall_faster, removes a stat (#120755):
read_file_large: Mean +- std dev: 21.2 us +- 0.6 us
# No isatty call (no pr yet)
read_file_large: Mean +- std dev: 20.2 us +- 0.3 us
No patch that removes the BufferedIO/TextIO seek
passes tests yet. Planning to see what happens if stash whole stat
result rather than lots of individual members.
It touches a lot of code to remove the remaining seek. BufferedIO has implementation for the "absolute position" abs_pos
to be optional / lazily filled. Once that is removed however, TextIO checks for "is this seekable" on construction, which is a boolean member that must be filled at construction currently. Still trying for a well contained way to do that.
The isatty
call can in theory have a shortcut added around it using the fstat
st_mode
information. Just adding a "if this is a regular file" check before it + member accessible from fileio I get a speedup on my Mac :
main
python ../benchmark.py
.....................
read_file_small: Mean +- std dev: 8.43 us +- 0.12 us
.....................
read_file_large: Mean +- std dev: 24.0 us +- 0.4 us
cmaloney/readall_faster (https://github.com/python/cpython/pull/120755)
.....................
read_file_small: Mean +- std dev: 7.92 us +- 0.07 us
.....................
read_file_large: Mean +- std dev: 21.2 us +- 0.6 us
No isatty (local only currently):
.....................
read_file_small: Mean +- std dev: 7.32 us +- 0.09 us
.....................
read_file_large: Mean +- std dev: 20.2 us +- 0.3 us
That reduces the time for reading a single large .rst
file in the cpython repo ~15% so far.
I built a little larger benchmark to more match code I ran into this with: "read all the .rst and .py in the cpython repo"
import pyperf
from pathlib import Path
def read_all(all_paths):
for p in all_paths:
p.read_bytes()
def read_file(path_obj):
path_obj.read_text()
all_rst = list(Path("Doc").glob("**/*.rst"))
all_py = list(Path(".").glob("**/*.py"))
assert all_rst, "Should have found .rst files"
assert all_py, "Should have found .py source files"
runner = pyperf.Runner()
runner.bench_func("read_file_small", read_file, Path("Doc/howto/clinic.rst"))
runner.bench_func("read_file_large", read_file, Path("Doc/c-api/typeobj.rst"))
runner.bench_func("read_all_rst", read_all, all_rst)
runner.bench_func("read_all_py", read_all, all_py)
And on my Linux box read_all_py
is ~9% faster with both changes at the moment.
I'm iterating on top of my existing PR code. Both removing the extra fstat
and isatty
require getting and stashing more data out of the initial stat result, so planning to see what the code looks like to just stash the whole stat result (although it is fairly big, so possibly with a hatch to unset it on write/seek or manually). This would also be a way to remove the _blksize
member which is also only used at open()
time (to figure out buffer size to pass to BufferedIO*
).
Could this be reopened? My plan is to keep working on trying to simplify what happens in this case, would like to attach to this issue (strace pr followup test, removing seeks, possibly removing isatty)
See also: gh-90102
With the set of PRs out currently, have this about as far as I think I can take it without reworking BufferedIO fairly substantially. I've removed the seek
a couple different ways (ex. Trying to make it lazier, passing in a "known position"), but they tend to make the code both more complex and have no measurable impact or a negative impact on performance. For cases which know they're reading a whole file, disabling buffering is profitable (ex: Path.read_bytes()
#122111), and it is likely profitable to make it possible to remove buffering for full file text reading, but I'd rather get a more efficient default I/O stack from open()
rather then making the common open call more complicated most places (setting buffering=0
). I may look into something like a piece tree to make BufferedIO more efficient and correct for its valuable lots of small read/write case (see: https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation). I think other techniques (ex. doing I/O operations in parallel or asynchronously), can likely make a bigger impact in more cases for common code out of the box (ex, see https://www.youtube.com/watch?v=Ul8OO4vQMTw).
My plan is to focus on getting the current set of changes out, as well as possibly talk a bit about what happens in a read()
since it's a lot more complicated than at least my mental model was (ex. os.read
isn't an ideal primitive compared to _Py_read
). With free threading and multiple interpreters, there are some intriguing possibilities with multi-dispatch and moving to python-native coroutine I/O code (ex. every open is actually two system calls which could happen in parallel if a platform supports, with the interpreter running other code while the OS does the needed work). That needs some protying and discussion out of scope of this change set.
TL; DR: No more changes planned outside followups. If there are any particular things I can help do to move PRs along, let me know. Plan is to try and help where I can on open PRs and issues in the repository focused around the I/O stack.
I left some investigation comments on https://github.com/python/cpython/pull/121143 regarding the test failure leading to the revert. TL;DR the system call name being tested for could also be statx
not just something with fstat
in the name.
I'll work on opening a new PR for the strace
refactor + fix the issues. Thanks for the fast revert and investigation.
*fstat*
+ statx
as acceptable calls. Reading through various I/O changes a bit, it's important that the just-opened FD is used to prevent TOCTOU security violations, so don't want to just reduce to stat
which takes a filename. Also include a comment about that in the test code.Path().read_bytes()
test case to additionally validate the reduction inisatty
/ioctl
+ seek
calls from https://github.com/python/cpython/pull/122111@hauntsaninja , @gpshead new PR with the strace helper applied + additional fixups from the revert: https://github.com/python/cpython/pull/123413
Note that the number of syscalls could theoretically go as low as 3 or 4: (open, read, read, close), if we added something like a max_read_size
param to a path.read_bytes()
. It doesn't help in the common case, but it can help for example when reading lots of small files (like in /proc/
), if you know the file is probably going be all be less than for example 8092 bytes. Memory gets allocated for the full read size so max_read_size
shouldn't be set too high when used.
(On linux at least, os.read() raises IsADirectoryError
, so the fstat
call isn't needed on linux to check if the file is a directory.)
Running a simple python script (python simple.py that contains print("Hello, World!")) currently reads simple.py in full at least 4 times and does over 5 seeks. I have been pulling on that thread but it interacts with importlib as well as how the python compiler currently works, still trying to get my head around. Would removing more of those overheads be something of interest / should I keep working to get my head around it?
I personally think it would be awesome to somehow get the syscall count reduced during import / startup time, which would improve startup time. I believe the code for reading files during import/startup is in importlib ._bootstrap_external.FileLoader.get_data().
It's tricky because io.open_code()
could be hooked to be pretty much anything (so probably can't just pass in buffering=0
), but ideally the file bytes should be read in full with only the 5 syscalls (no seeks).
Your isatty
and readall()
fixes are probably already a huge improvement, though I'm guessing there's still the extra lseek()
call in readall()
.
(It's a little funny that the io.open_code()
hook is only for opening files, when all that's done is call .read()
on the open file. Ideally the hook would be more like io.read_code_bytes()
which would be much easier to optimize.)
I've been using this code on linux for the last few years to cut down on import syscalls:
def faster_read_file_bytes(path: str, size=-1) -> bytes:
fd = os.open(path, os.O_RDONLY)
try:
if size == -1:
size = os.fstat(fd).st_size
return os.read(fd, size)
finally:
os.close(fd)
def faster_get_data(loader: object, filepath: str) -> bytes:
return faster_read_file_bytes(filepath)
def install_import_speedup() -> None:
assert sys.path_hooks[-1].__closure__
for loader, _ in sys.path_hooks[-1].__closure__[1].cell_contents:
loader.get_data = faster_get_data
re: max_read_size my plan / hope as I get into BufferedIO changes is to make it so can pass in a buffer rather than a max size so that users can control buffer reuse / avoid an allocation if they want (ex. /proc/
scans like you gave). Default / no buffer passed, then the I/O code would call the allocator to get a new buffer with its reuse logic, but also enable code which focuses on scans of small files and similar to be able work really fast.
Unfortunately the _io.open()
code which does the stat check for "is this a directory" doesn't have any guarantee code will call read
(and so can't rely on that). One of the paths I'm looking at though is being able to do multiple system call dispatch (ex. pathlib.Path().read_bytes()
knows the whole I/O pattern which will be required, and could do that sort of optimization. Also could potentially dispatch all at once on OSes with the right APIs...). There's also both the io.open_code
hook, and the "is directory on open" is cloned a couple times around the codebase that add intricacies to reworking, see: https://github.com/python/cpython/blob/c6127af8685c2a9b416207e46089cee79d028b85/Modules/main.c#L393-L398 https://github.com/python/cpython/blob/c6127af8685c2a9b416207e46089cee79d028b85/Python/pylifecycle.c#L2758-L2762
Currently for me though the cost of the extra seeks (and even isatty
) is quite a bit less (<5% overhead on modern macOS or Linux) than the cost of the allocations/construction of a BufferedIO (makes a new lock + buffer > 10% perf change). So focused on other directions for a bit... BufferedIO is required for TextIO which is what open()
does by default (Read bytes then convert to text w/ TextIO layer).
Definitely also looking at general imports, and the lots of scanning / stats required to try and find .pyc
files and the like... Many thoughts, but trying to keep this particular project scoped smallish so I can close it out.
the cost of the allocations/construction of a BufferedIO
Ohh maybe we could change the default io.open_code()
implementation to be unbuffered? Possibly backward incompatible, but it would avoid a bunch of BufferedIO
allocations during import/startup.
I think I can get BufferedIO to be cheap, it also shows up in reading folders with lots of .rst
, .yaml
, .toml
, etc. files which I've run into a lot. With https://peps.python.org/pep-0686/ getting to TextIO is required to decode the utf-8 as needed, and TextIO requires BufferedIO (https://peps.python.org/pep-3116/#text-i-o, some caveats as codebase has evolved from initial design). A number of the cases in python startup do have "Reduced" IO stacks that avoid BuffferedIO w/ a custom implementation to improve startup time today :). https://github.com/python/cpython/pull/122111 shows potential overhead reduction. If people want optimization before BufferedIO changes, highly recommend using pathlib.Path().read_bytes()
. BufferedIO + TextIO does also special case "Read whole file", but unfortunately still quite a bit of overhead when I/O is much faster and lower latency on 2024 hardware + OSes than when the original design and implementation happened.
One of the paths I'm looking at though is being able to do multiple system call dispatch (ex. pathlib.Path().read_bytes() knows the whole I/O pattern which will be required, and could do that sort of optimization.
highly recommend using pathlib.Path().read_bytes()
Also, it would be nice to get something like read_bytes
and read_text
directly in os
or io
, for those of us who feel like pathlib
is heavyweight (though I do see it uses __slots__
so that's a plus).
getting to TextIO is required to decode the utf-8 as needed, and TextIO requires BufferedIO
I think in my testing in the past I've found read_bytes().decode()
is often faster than opening as text, at least for normal sized files. TextIO
is probably still worth it for large text files.
Planning to close this issue once the two outstanding PRs are closed. There are more improvements I think are possible, but they're large enough scoped individual changes they should likely get their own Discourse / github issue / etc. In particular I'm looking at:
@unleash_dragons
and that could rewrite lots of parallelizable dispatches through I/O into a multi-dispatch, ideally able to use things like io_uring
where supported, but WaitForMultipleObjects
on windows as well. Particular motivating pattern I've written a bit below# Read full contents of a list of files
@unleash_dragons(gather_io=True)
def read_configs(all_paths):
for p in all_paths:
p.read_bytes()
Currently that does a lot of round-trips from Python -> OS -> Python, pipelining / parallelizing those requests (one io_uring wait call) significantly improves performance.
For the startup "read 4 times" case, I looked a bit and it's really complex to unwind for likely minimal performance (OS filecache saves quite a bit). It's a combination of supporting https://peps.python.org/pep-0263/ and the tokenizer taking a FILE*
(https://github.com/python/cpython/blob/main/Parser/tokenizer/tokenizer.h#L9-L11 and https://github.com/python/cpython/blob/main/Parser/peg_api.c#L18-L40). There is a lot of simplification possible I think but it would take a lot of refactoring for no particular feature addition or large performance win, especially as that particular invocation case is already often cached/bypassed using other mechanisms (ex. .pyc
files).
Some additional complexity in refactoring is that there is a long-standing hook io.open_code opens in Python's I/O then that needs to become a FILE*, and the API is somewhat exposed in the C API (https://docs.python.org/3/c-api/file.html#c.PyFile_FromFd)
@cmaloney closed this as completed Nov 3, 2024
Congrats :-)
Thanks for your help in reviewing, helping me learn CPython, and getting the patches landed; sorry there were quite so many follow up patches, but think we got there :). Was measuring 3.13 vs. 3.14 main "read a whole file" performance and it's generally between 10-15% reading OS cached files on MacOS and Linux for a LTO optimized build. Nice little wins while keeping overall API guarantees / stability :)
Python 3.14 commit: commit 7d7d56d8b1147a6b85e1c09d01b164df7c5c4942 (HEAD -> main, origin/main, origin/HEAD) Python 3.13 commit: commit 60403a5409ff2c3f3b07dd2ca91a7a3e096839c7 (tag: v3.13.0, v3.13.0) ## Python 3.13 ``` (pyperf_313_env) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 6.21 us +- 0.10 us ..................... read_file_large: Mean +- std dev: 18.1 us +- 0.6 us ..................... read_all_rst_bytes: Mean +- std dev: 1.14 ms +- 0.01 ms ..................... read_all_py_bytes: Mean +- std dev: 10.8 ms +- 0.1 ms ..................... read_all_rst_text: Mean +- std dev: 2.09 ms +- 0.05 ms (pyperf_313_env) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 6.22 us +- 0.09 us ..................... read_file_large: Mean +- std dev: 17.2 us +- 1.1 us ..................... read_all_rst_bytes: Mean +- std dev: 1.10 ms +- 0.01 ms ..................... read_all_py_bytes: Mean +- std dev: 11.7 ms +- 0.4 ms ..................... read_all_rst_text: Mean +- std dev: 2.00 ms +- 0.05 ms (pyperf_313_env) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 6.06 us +- 0.08 us ..................... read_file_large: Mean +- std dev: 17.0 us +- 0.6 us ..................... read_all_rst_bytes: Mean +- std dev: 1.10 ms +- 0.02 ms ..................... read_all_py_bytes: Mean +- std dev: 11.8 ms +- 0.2 ms ..................... read_all_rst_text: Mean +- std dev: 2.06 ms +- 0.06 ms (pyperf_313_env) ➜ cpython git:(heads/v3.13.0) python --version Python 3.13.0 ``` ## Python 3.14a0+ (gh-120754 done) ``` (pyperf_314_env) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 5.04 us +- 0.15 us ..................... read_file_large: Mean +- std dev: 15.1 us +- 1.0 us ..................... read_all_rst_bytes: Mean +- std dev: 820 us +- 32 us ..................... read_all_py_bytes: Mean +- std dev: 9.31 ms +- 0.24 ms ..................... read_all_rst_text: Mean +- std dev: 1.76 ms +- 0.08 ms (pyperf_314_env) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 5.02 us +- 0.08 us ..................... read_file_large: Mean +- std dev: 14.6 us +- 0.6 us ..................... read_all_rst_bytes: Mean +- std dev: 811 us +- 12 us ..................... read_all_py_bytes: Mean +- std dev: 9.33 ms +- 0.15 ms ..................... read_all_rst_text: Mean +- std dev: 1.75 ms +- 0.06 ms (pyperf_314_env) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 5.02 us +- 0.08 us ..................... read_file_large: Mean +- std dev: 15.0 us +- 0.8 us ..................... read_all_rst_bytes: Mean +- std dev: 809 us +- 12 us ..................... read_all_py_bytes: Mean +- std dev: 9.29 ms +- 0.09 ms ..................... read_all_rst_text: Mean +- std dev: 1.74 ms +- 0.02 ms (pyperf_314_env) ➜ cpython git:(main) python --version Python 3.14.0a1+ ``` # MacOS MacOS v3.13.0 ``` (pyperf_313_venv) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 9.34 us +- 0.26 us ..................... read_file_large: Mean +- std dev: 22.3 us +- 0.6 us ..................... read_all_rst: Mean +- std dev: 2.87 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 20.8 ms +- 0.3 ms (pyperf_313_venv) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 9.30 us +- 0.17 us ..................... read_file_large: Mean +- std dev: 21.9 us +- 0.2 us ..................... read_all_rst: Mean +- std dev: 2.87 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 20.8 ms +- 0.2 ms (pyperf_313_venv) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 9.22 us +- 0.10 us ..................... read_file_large: Mean +- std dev: 21.8 us +- 0.2 us ..................... read_all_rst: Mean +- std dev: 2.87 ms +- 0.04 ms ..................... read_all_py: Mean +- std dev: 20.8 ms +- 0.5 ms (pyperf_313_venv) ➜ cpython git:(heads/v3.13.0) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 9.25 us +- 0.07 us ..................... read_file_large: Mean +- std dev: 21.8 us +- 0.2 us ..................... read_all_rst: Mean +- std dev: 2.86 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 20.8 ms +- 0.3 ms ``` MacOS v3.14.0a0+ commit 34653bba644aa5481613f398153757d7357e39ea (HEAD -> main, origin/main, origin/HEAD) ``` (pyperf_314_venv) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 8.18 us +- 0.08 us ..................... read_file_large: Mean +- std dev: 21.1 us +- 0.3 us ..................... read_all_rst: Mean +- std dev: 2.58 ms +- 0.05 ms ..................... read_all_py: Mean +- std dev: 19.0 ms +- 0.2 ms (pyperf_314_venv) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 8.17 us +- 0.11 us ..................... read_file_large: Mean +- std dev: 21.1 us +- 0.3 us ..................... read_all_rst: Mean +- std dev: 2.57 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 19.1 ms +- 0.3 ms (pyperf_314_venv) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 8.16 us +- 0.07 us ..................... read_file_large: Mean +- std dev: 21.2 us +- 0.3 us ..................... read_all_rst: Mean +- std dev: 2.58 ms +- 0.05 ms ..................... read_all_py: Mean +- std dev: 19.0 ms +- 0.2 ms (pyperf_314_venv) ➜ cpython git:(main) python ../bm_readall.py ..................... read_file_small: Mean +- std dev: 8.20 us +- 0.07 us ..................... read_file_large: Mean +- std dev: 21.1 us +- 0.3 us ..................... read_all_rst: Mean +- std dev: 2.58 ms +- 0.03 ms ..................... read_all_py: Mean +- std dev: 19.1 ms +- 0.4 ms ``` ```Python import pyperf from pathlib import Path all_rst = list(Path("Doc").glob("**/*.rst")) all_py = list(Path(".").glob("**/*.py")) assert all_rst, "Should have found rst files. Is the current directory a cpython checkout?" assert all_py, "Should have found python source files. Is the current directory a cpython checkout?" def read_all_bytes(all_paths): for p in all_paths: p.read_bytes() def read_all_text(all_paths): for p in all_paths: p.read_text() def read_file(path_obj): path_obj.read_text() def read_bytes(path_obj): path_obj.read_bytes() # print(f"all_rst={len(all_rst)}|all_py={len(all_py)}") # Make a fixed number across python versions. # File sizes may still vary a bit, but this helps a lot for stability, all_rst = all_rst[:200] all_py = all_py[:3000] assert len(all_rst) == 200 assert len(all_py) == 3000 runner = pyperf.Runner() runner.bench_func('read_file_small', read_file, Path("Doc/howto/clinic.rst")) runner.bench_func('read_file_large', read_file, Path("Doc/c-api/typeobj.rst")) runner.bench_func('read_all_rst_bytes', read_all_bytes, all_rst) runner.bench_func('read_all_py_bytes', read_all_bytes, all_py) # NOTE: No test to read all `.py` as text as some contains invalid utf-8 byte sequences. runner.bench_func('read_all_rst_text', read_all_text, all_rst) ```
Maybe you should document this optimization at: https://docs.python.org/dev/whatsnew/3.14.html#optimizations
Feature or enhancement
Proposal:
I came across some seemingly redundant
fstat()
andlseek()
calls when working on a tool that scanned a directory of lots of small YAML files and loaded their contents as config. In tracing I found most execution time wasn't in the python interpreter but system calls (on top of NFS in that case, which made some I/O calls particularly slow).I've been experimenting with a program that reads all
.rst
files in the pythonDocs
directory to try and remove some of those redundant system calls..Test Program
In my experimentation, with some tweaks to fileio can remove over 10% of the system calls the test program makes when scanning the whole
Doc
folders for.rst
files on both macOS and Linux (don't have a Windows machine to measure on).Current State (9 system calls)
Currently on my Linux machine to read a whole
.rst
file with the above code there is this series of system calls:Target State (
75 system calls)It would be nice to get it down to (for small files, large file caveat in PR / get an additional seek):
In a number of cases (ex. importing modules) there is often a
fstat
followed immediately by an open / read the file (which does anotherfstat
typically), but that is an extension point and I want to keep that out of scope for now.Questions rattling around in my head around this
Some of these are likely better for Discourse / longer form discussion, happy to start threads there as appropriate.
python simple.py
that containsprint("Hello, World!")
) currently readssimple.py
in full at least 4 times and does over 5 seeks. I have been pulling on that thread but it interacts with importlib as well as how the python compiler currently works, still trying to get my head around. Would removing more of those overheads be something of interest / should I keep working to get my head around it?_blksize
member of fileio was added in bpo-21679. It is not used much as far as I can tell as its reflection_blksize
in python or in the code. The only usage I can find is https://github.com/python/cpython/blob/main/Modules/_io/_iomodule.c#L365-L374, where we could just query for it when needed in that case to save some storage on allfileio
objects. The behavior of using the stat returned st_blksize is part of the docs, so doesn't feel like we can fully remove it.Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs