python / cpython

The Python programming language
https://www.python.org
Other
62.32k stars 29.93k forks source link

Performance improvement for os.scandir on Windows #122885

Open Michael-K-Stein opened 1 month ago

Michael-K-Stein commented 1 month ago

Feature or enhancement

Proposal:

As has been mentioned in numerous issues (See #119169 for example) , there are quite a few performance issues regarding os.scandir and os.walk.
Diving into the implementation - see os_scandir_impl in posixmodule.c - I noticed that we are currently using WinAPI to list a directory. Looking into the relevant WinAPI functions (FindNextFileW, FindFirstFileW) it seems redundant to implement our Python wrapper around these wrappers. I propose to use the native NT functions - for example NtQueryDirectoryFile - directly, as we are already implementing a wrapper ourselves.
After quite a bit of reading MS-Docs and looking at Kernel32.dll & NtDll.dll, I have reached a stable implementation of this proposal. Unfortunately, I could not significantly reduce the amount of syscalls being performed, however I did reduce the amount of memory allocation and copying by a ratio of 1:6. Additionally, this new implementation should aid in future implementations around Windows file system operations (See #99454 for any easy example).
This improves the performance of both os.scandir and os.walk which is implemented over it.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

eryksun commented 1 month ago

Python has always limited itself to the Windows API. Back in the 1990s, Microsoft partially documented the user-mode NT API for use by third-party subsystems, and by services associated with drivers. They've documented more of the NT API over the past 30 years. However, it has never been intended for direct use by applications. That doesn't stop some developers, but just because Microsoft hasn't aggressively discouraged this practice, it's still not actually encouraged.

The Windows API has alternatives to FindFirstFileW() and FindNextFileW(). You can use GetFileInformationByHandleEx() with FileIdBothDirectory[Restart]Info, FileFullDirectory[Restart]Info, or FileIdExtdDirectory[Restart]Info.