python / cpython

The Python programming language
https://www.python.org
Other
62.59k stars 30.03k forks source link

[Windows] OneDrive SharePoint sync folder - incorrect values returned by os.listdir and win32file.FindFilesW #102993

Open GordonAitchJay opened 1 year ago

GordonAitchJay commented 1 year ago

This is an issue experienced by a user on StackOverflow, so please excuse the lack of details and MRE. I'm hoping a Windows internals expert and/or a OneDrive dev can shed light on the situation.

Why does os.walk() (Python) ignore a OneDrive directory depending on the number of files in it?

The user has a directory which is a sync/shortcut of a SharePoint folder containing 897 files (all files can be opened, they are downloaded, not on-demand). When calling os.listdir with this directory, an exception is raised: OSError: [WinError 87] The parameter is incorrect:. However, if 2 files are deleted, it returns all the files (besides the 2 which were deleted). If the directory is copied somewhere outside the purview of OneDrive, os.listdir returns all 897 files.

Calling win32file.FindFilesW behaves the same as os.listdir. With 897 files it raises an exception: error: (87, 'FindNextFileW', 'The parameter is incorrect.'). After deleting 2 files, it returns all the files.

When calling win32file.FindFilesIterator when the directory has all 897 files, 443 files are yielded before the error occurs. glob.glob() is the same but doesn't yield . or .. (as expected). Strangely, if only 1 file is deleted, win32file.FindFilesIterator yields only 25 files!

If the directory is copied to the local OneDrive root directory, os.listdir initially works (when OneDrive had just started uploading the files). However, after a couple of minutes, once a number of the files have been uploaded, os.listdir results in OSError: [WinError 87] The parameter is incorrect: again. Even before all files have synced, win32file.FindFilesIterator yields only 443 files again.

Explorer always shows the full list of files, and so does cmd's dir, and powershell's ls and gci.

Calling NtQueryDirectoryFile directly with ctypes always shows the full list of files

I'm fairly sceptical that CPython is at fault here, but I find it utterly bizarre that cmd's dir, and powershell's ls and gci work, which all call FindNextFileW, yet when CPython calls the same function it predictably returns prematurely.

eryksun commented 1 year ago

OneDrive uses placeholder reparse points, which by default are disguised as regular files and directories if the process executable is outside of the "%SystemRoot%" tree. Placeholder reparse points are thus exposed to "cmd.exe" since it's inside "%SystemRoot%". You can check this via dir /al (i.e. list only reparse points). PowerShell 7, on the other hand, is installed in "%ProgramFiles%", so by default placeholders would be disguised for it. However, it opts into exposing placeholders by calling RtlSetProcessPlaceholderCompatibilityMode(PHCM_EXPOSE_PLACEHOLDERS).

CPython has opted to use the default setting that disguises placeholder reparse points. It could be that the filesystem filter driver that handles OneDrive reparse points is failing in some way when placeholders are disguised. To rule out placeholder disguising as the cause of the different behavior, you could ask the SO user to try running the following code before calling os.listdir() on the OneDrive "BigFolder" directory.

import ctypes
ntdll = ctypes.WinDLL('ntdll')
PHCM_EXPOSE_PLACEHOLDERS = 2
ntdll.RtlSetProcessPlaceholderCompatibilityMode.argtypes = (ctypes.c_char,)
ntdll.RtlSetProcessPlaceholderCompatibilityMode(PHCM_EXPOSE_PLACEHOLDERS)

FindFirstFileW() / FindNextFileW() repeatedly calls NtQueryDirectoryFileEx(): FileBothDirectoryInformation. The "Both" in the name of this info class refers to returning both the normal name and the short name (if any) of each directory entry. It also returns the size of the extended attributes (EaSize) set on a file or directory, if any. The latter field gets reused to return the reparse tag if the entry refers to a reparse point, since a reparse point can't have extended attributes.

Apparently my old ctypes code that calls NtQueryDirectoryFile(): FileDirectoryInformation works to list "BigFolder". Note that the FileDirectoryInformation info class doesn't include short names or the EA size / reparse tag. If it turns out that the issue is with disguised placeholders, it may be that the bug is limited to either the NtQueryDirectoryFileEx() system call or the FileBothDirectoryInformation information class. That could be something to explore further.

jwhendy commented 1 year ago

I am far from the Windows internals expertise required to parse all of the above. @GordonAitchJay and @eryksun : would you have a layman's summary on what this means for usability of python and OneDrive, any possible workarounds, and what the ETA might look like?

I just started running into this issue this week, also with R. I wrote it off as a fluke until I ran into it with python as well. I can do os.listdir(f'{base_dir}/..') and see that my target directory is listed, but os.listdir(base_dir) yields:


---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Input In [39], in <cell line: 1>()
----> 1 os.listdir(base_dir)

OSError: [WinError 87] The parameter is incorrect: 'C:/Users/username/OneDrive - Company/path/to/target_dir'
GordonAitchJay commented 1 year ago

@eryksun Thank you for your insight. It's very interesting!

What are the benefits of placeholder reparse points being disguised as regular files and directories if the process executable is outside of the "%SystemRoot%" tree? It's clearly a deliberate decision.

@jwhendy As far as I know, you're only the second Python user to have encountered this problem. I can't replicate it. Which directories do you have this problem with? It doesn't appear to be all directories managed by OneDrive, at least not all the time.

Instead of using os.listdir, you can use ctypes to call the lower level NtQueryDirectoryFile function. @eryksun originally posted an implementation in an answer to a question on StackOverflow. You would only need to make a few minor changes to make it a drop-in replacement for os.listdir.

Please follow @eryksun's suggestion above which will prevent placeholder reparse points from being disguised as regular files and directories. Before calling os.listdir(base_dir), run this:

import ctypes
ntdll = ctypes.WinDLL('ntdll')
PHCM_EXPOSE_PLACEHOLDERS = 2
ntdll.RtlSetProcessPlaceholderCompatibilityMode.argtypes = (ctypes.c_char,)
ntdll.RtlSetProcessPlaceholderCompatibilityMode(PHCM_EXPOSE_PLACEHOLDERS)

Does os.listdir(base_dir) now return the full list of files?

@eryksun Assuming the call to RtlSetProcessPlaceholderCompatibilityMode works as a workaround, should CPython instead opt into exposing placeholders? What are the downsides? Though it seems silly to me that all software that uses FindFirstFile/FindNextFile/FindClose to list files (seemingly the canonical way) and happens to be outside of the "%SystemRoot%" tree must be updated so it calls RtlSetProcessPlaceholderCompatibilityMode beforehand to avoid this issue.

I don't think there will be an ETA for a fix, since this similar issue raised by @eryksun was rejected because RtlSetProcessPlaceholderCompatibilityMode requires Windows 10.

I suppose it would be possible to change the implementation of os.listdir, since NtQueryDirectoryFile, NtQueryInformationFile, and CreateFileW are all available on Windows XP, but I don't think that will actually happen. glob.glob would need to be changed, too.

eryksun commented 1 year ago

What are the benefits of placeholder reparse points being disguised as regular files and directories if the process executable is outside of the "%SystemRoot%" tree? It's clearly a deliberate decision.

This is explained in the documentation. Some programs mistakenly handle all reparse points as if they're symbolic links, instead of checking the name-surrogate bit [*] of the reparse tag using the macro IsReparseTagNameSurrogate() or checking for a symlink exactly. Python's os.stat() made this mistake prior to Python 3.7. To work around this, placeholder reparse points were implemented to be disguised by default for non-system processes.

Assuming the call to RtlSetProcessPlaceholderCompatibilityMode works as a workaround, should CPython instead opt into exposing placeholders? What are the downsides?

A downside to exposing placeholder reparse points is that os.scandir() entries may not have updated basic stat data from the target of the placeholder, i.e. timestamps, file attributes, and file size. The entry may also have attributes that it otherwise doesn't have when placeholders are disguised, such as

I suppose it would be possible to change the implementation of os.listdir, since NtQueryDirectoryFile, NtQueryInformationFile, and CreateFileW are all available on Windows XP

It's unlikely that Python's standard library will ever call NTAPI system calls directly, such as NtQueryDirectoryFile().

The os module could switch to using GetFileInformationByHandleEx() with one of the directory information classes, such as FileIdBothDirectoryInfo. I haven't reproduced this issue locally, but I can implement a demo of listdir() based on GetFileInformationByHandleEx() using ctypes.

I've actually wanted to switch to using FileIdBothDirectoryInfo for a long time, to support the 64-bit FileId and the ChangeTime in the stat() method of os.scandir() entries. (Note that if an entry is a reparse point, the reparse tag is returned in the EaSize field. A reparse point cannot have extended attributes. This is documented in [MS-FSCC] for NTAPI FileIdBothDirectoryInformation.)


[*] Here is some optional background information on the two types of name-surrogate reparse points that are commonly used in the NTFS and ReFS filesystems, and how Python supports them. It's off topic, but I think it's important to understanding the overall system of reparse points, which is a complex subject that's limited to just Windows.

When passed follow_symlinks=False, os.stat() opens symlinks, junctions, and any other name-surrogate type of reparse point. os.symlink() creates symlinks; there's no support for creating junctions. os.readlink() returns the target path of symlinks and junctions. os.unlink() deletes symlinks and junctions instead of deleting their targets, as does shutil.rmtree(), but shutil.copytree() copies and traverses a junction as a regular directory. os.path.islink() is true only for symlinks. In 3.12, os.path.isjunction() was added to test for junctions.

eryksun commented 1 year ago

Here's a first draft ctypes-based prototype of scandir() and listdir() functions that call GetFileInformationByHandleEx() to query the directory information class FileIdBothDirectoryInfo. If the filesystem doesn't support FileIdBothDirectoryInfo, the implementation falls back on the older information class FileFullDirectoryInfo (e.g. currently the fallback is needed to list the named-pipe filesystem, "\\.\pipe\"). The path can be str, bytes, or a path-like object. It can also be a file descriptor for an open directory. A directory can be opened with os.open() by using the flag O_OBTAIN_DIR (0x2000).

If you can reproduce the reported problem with OneDrive and os.listdir(), please test these two functions, and let me know if they work.

import os
import stat
import msvcrt
import collections
import ctypes

from ctypes import wintypes

kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

ERROR_INVALID_FUNCTION = 1
ERROR_NO_MORE_FILES = 18
ERROR_NOT_SUPPORTED = 50
ERROR_INVALID_PARAMETER = 87
ERROR_MORE_DATA = 234
ERROR_DIRECTORY = 267

INVALID_HANDLE_VALUE = wintypes.HANDLE(-1).value
FILE_TYPE_DISK = 1
FILE_READ_DATA = 1
FILE_SHARE_READ = 1
OPEN_EXISTING = 3
FILE_FLAG_BACKUP_SEMANTICS = 0x02000000
O_OBTAIN_DIR = 0x2000 # os.open() flag that opens with backup semantics

FileBasicInfo = 0
FileIdBothDirectoryInfo = 10
FileIdBothDirectoryRestartInfo = 11
FileFullDirectoryInfo = 14
FileFullDirectoryRestartInfo = 15

FILE_INFO_BY_HANDLE_CLASS = wintypes.ULONG
LPSECURITY_ATTRIBUTES = wintypes.LPVOID

kernel32.CreateFileW.restype = wintypes.HANDLE
kernel32.CreateFileW.argtypes = (
    wintypes.LPCWSTR,      # In     lpFileName
    wintypes.DWORD,        # In     dwDesiredAccess
    wintypes.DWORD,        # In     dwShareMode
    LPSECURITY_ATTRIBUTES, # In_opt lpSecurityAttributes
    wintypes.DWORD,        # In     dwCreationDisposition
    wintypes.DWORD,        # In     dwFlagsAndAttributes
    wintypes.HANDLE)       # In_opt hTemplateFile

kernel32.GetFileInformationByHandleEx.argtypes = (
  wintypes.HANDLE,           # hFile
  FILE_INFO_BY_HANDLE_CLASS, # FileInformationClass
  wintypes.LPVOID,           # lpFileInformation
  wintypes.DWORD)            # dwBufferSize

stat_result = collections.namedtuple('stat_result',
    ('st_mode', 'st_ino', 'st_dev', 'st_nlink', 'st_uid', 'st_gid', 'st_size',
     'st_atime', 'st_mtime', 'st_ctime', 'st_btime', 'st_atime_ns',
     'st_mtime_ns', 'st_ctime_ns', 'st_btime_ns', 'st_change_time',
     'st_change_time_ns', 'st_file_attributes', 'st_reparse_tag'))

class FILE_BASIC_INFO(ctypes.Structure):
    _fields_ = (('CreationTime',   wintypes.LARGE_INTEGER),
                ('LastAccessTime', wintypes.LARGE_INTEGER),
                ('LastWriteTime',  wintypes.LARGE_INTEGER),
                ('ChangeTime',     wintypes.LARGE_INTEGER),
                ('FileAttributes', wintypes.DWORD))

class FILE_BASE_DIR_INFO(ctypes.Structure):
    __slots__ = ()

    @property
    def FileName(self):
        length = self._FileNameLength
        if not length:
            return ''
        addr = ctypes.addressof(self) + type(self)._FileName.offset
        size = length // ctypes.sizeof(wintypes.WCHAR)
        return (wintypes.WCHAR * size).from_address(addr).value

    @property
    def EaSize(self):
        # Since a reparse point cannot have extended attributes, the EaSize
        # field is reused to store the reparse tag if the entry is a reparse
        # point. This behavior is documented in [MS-FSCC].
        # https://learn.microsoft.com/openspecs/windows_protocols/ms-fscc/e8d926d1-3a22-4654-be9c-58317a85540b
        if not (self.FileAttributes & stat.FILE_ATTRIBUTE_REPARSE_POINT):
            return self._EaSize
        return 0

    @property
    def ReparseTag(self):
        # See the comment about EaSize.
        if self.FileAttributes & stat.FILE_ATTRIBUTE_REPARSE_POINT:
            return self._EaSize
        return 0

class FILE_FULL_DIR_INFO(FILE_BASE_DIR_INFO):
    __slots__ = ()
    _fields_ = (('_NextEntryOffset', wintypes.DWORD),
                ('_FileIndex',       wintypes.DWORD),
                ('CreationTime',     wintypes.LARGE_INTEGER),
                ('LastAccessTime',   wintypes.LARGE_INTEGER),
                ('LastWriteTime',    wintypes.LARGE_INTEGER),
                ('ChangeTime',       wintypes.LARGE_INTEGER),
                ('EndOfFile',        wintypes.LARGE_INTEGER),
                ('AllocationSize',   wintypes.LARGE_INTEGER),
                ('FileAttributes',   wintypes.DWORD),
                ('_FileNameLength',  wintypes.DWORD),
                ('_EaSize',          wintypes.DWORD),
                ('_FileName',         wintypes.WCHAR * 1))

class FILE_ID_BOTH_DIR_INFO(FILE_BASE_DIR_INFO):
    __slots__ = ()
    _fields_ = (('_NextEntryOffset', wintypes.DWORD),
                ('_FileIndex',       wintypes.DWORD),
                ('CreationTime',     wintypes.LARGE_INTEGER),
                ('LastAccessTime',   wintypes.LARGE_INTEGER),
                ('LastWriteTime',    wintypes.LARGE_INTEGER),
                ('ChangeTime',       wintypes.LARGE_INTEGER),
                ('EndOfFile',        wintypes.LARGE_INTEGER),
                ('AllocationSize',   wintypes.LARGE_INTEGER),
                ('FileAttributes',   wintypes.DWORD),
                ('_FileNameLength',  wintypes.DWORD),
                ('_EaSize',          wintypes.DWORD),
                ('_ShortNameLength', wintypes.BYTE),
                ('_ShortName',       wintypes.WCHAR * 12),
                ('FileId',           wintypes.LARGE_INTEGER),
                ('_FileName',        wintypes.WCHAR * 1))

class DirEntry:
    __slots__ = ('_dirpath', '_info')

    def __init__(self, dirpath, info):
        self._dirpath = dirpath
        self._info = info

    def __repr__(self):
        return '<{} {!r}>'.format(self.__class__.__name__, self.name)

    @classmethod
    def _listbuf(cls, buf, info_class, dirpath):
        result = []
        if info_class == FileIdBothDirectoryInfo:
            info_struct = FILE_ID_BOTH_DIR_INFO
        elif info_class == FileFullDirectoryInfo:
            info_struct = FILE_FULL_DIR_INFO
        else:
            raise ValueError('unsupported information class')
        base_size = ctypes.sizeof(info_struct) - ctypes.sizeof(wintypes.WCHAR)
        offset = 0
        while True:
            tmp = info_struct.from_buffer(buf, offset)
            if tmp._FileNameLength and tmp.FileName not in ('.', '..'):
                info = info_struct()
                size = base_size + tmp._FileNameLength
                ctypes.resize(info, size)
                ctypes.memmove(ctypes.byref(info), ctypes.byref(tmp), size)
                entry = cls(dirpath, info)
                result.append(entry)
            if tmp._NextEntryOffset:
                offset += tmp._NextEntryOffset
            else:
                break
        return result

    def _is_name_surrogate(self):
        return bool(self._info.ReparseTag & 0x20000000)

    def _is_reparse_point(self):
        return bool(self._info.FileAttributes &
                    stat.FILE_ATTRIBUTE_REPARSE_POINT)

    @property
    def name(self):
        if isinstance(self._dirpath, bytes):
            return os.fsencode(self._info.FileName)
        return self._info.FileName

    @property
    def path(self):
        return os.path.join(self._dirpath, self.name)

    def stat(self, follow_symlinks=True):
        def nt_time_as_posix_ns(t):
            if t == 0:
                return 0
            # NT has an epoch of 1601, and its time unit is 100 ns.
            return (t - 116444736000000000) * 100

        if (self._is_reparse_point() and
               (follow_symlinks or not self._is_name_surrogate())):
            return os.stat(self.path)
        if self._info.ReparseTag == stat.IO_REPARSE_TAG_SYMLINK:
            mode = stat.S_IFLNK
        elif self._info.FileAttributes & stat.FILE_ATTRIBUTE_DIRECTORY:
            mode = stat.S_IFDIR
        else:
            pipe_paths = ('\\\\.\\pipe', '\\\\?\\pipe')
            drive = os.path.splitdrive(os.fsdecode(self._dirpath))[0]
            if drive and os.path.normcase(drive) in pipe_paths:
                mode = stat.S_IFIFO
            else:
                mode = stat.S_IFREG
        file_id = getattr(self._info, 'FileId', 0)
        atime_ns = nt_time_as_posix_ns(self._info.LastAccessTime)
        mtime_ns = nt_time_as_posix_ns(self._info.LastWriteTime)
        # BUGBUG: POSIX st_ctime should be the metadata change time, and
        # st_btime should be the creation (birth) time. But Python
        # follows the Windows C runtime implementation, which back in the
        # days of MS-DOS in the 1980s, before there was even a POSIX
        # standard, chose to redefine Unix st_ctime as the creation time.
        # They should have added a new field for the creation time, and
        # they should have ignored st_ctime until they had a filesystem
        # that supported it, i.e. NTFS on Windows NT in 1993.
        ctime_ns = nt_time_as_posix_ns(self._info.CreationTime)
        btime_ns = nt_time_as_posix_ns(self._info.CreationTime)
        change_time_ns = nt_time_as_posix_ns(self._info.ChangeTime)
        return stat_result(
                st_mode=mode,
                st_ino=file_id,
                st_dev=0,
                st_nlink=0,
                st_uid=0,
                st_gid=0,
                st_size=self._info.EndOfFile,
                st_atime=atime_ns // 10**9,
                st_mtime=mtime_ns // 10**9,
                st_ctime=ctime_ns // 10**9,
                st_btime=btime_ns // 10**9,
                st_atime_ns=atime_ns,
                st_mtime_ns=mtime_ns,
                st_ctime_ns=ctime_ns,
                st_btime_ns=btime_ns,
                st_change_time=change_time_ns // 10**9,
                st_change_time_ns=change_time_ns,
                st_file_attributes = self._info.FileAttributes,
                st_reparse_tag = self._info.ReparseTag)

    def inode(self):
        if (not hasattr(self._info, 'FileId') or
              (self._is_reparse_point() and not self._is_name_surrogate())):
            return os.stat(self.path).st_ino
        return self._info.FileId

    def is_dir(self, follow_symlinks=True):
        if self._is_reparse_point():
            if follow_symlinks or not self._is_name_surrogate():
                return os.path.isdir(self.path)
            if self._info.ReparseTag == stat.IO_REPARSE_TAG_SYMLINK:
                return False
        if self._info.FileAttributes & stat.FILE_ATTRIBUTE_DIRECTORY:
            return True
        return False

    def is_file(self, follow_symlinks=True):
        if self._is_reparse_point():
            if follow_symlinks or not self._is_name_surrogate():
                return os.path.isfile(self.path)
            if self._info.ReparseTag == stat.IO_REPARSE_TAG_SYMLINK:
                return False
        if self._info.FileAttributes & stat.FILE_ATTRIBUTE_DIRECTORY:
            return False
        pipe_paths = ('\\\\.\\pipe', '\\\\?\\pipe')
        drive = os.path.splitdrive(os.fsdecode(self._dirpath))[0]
        if drive and os.path.normcase(drive) in pipe_paths:
            return False
        return True

    def is_symlink(self):
        return self._info.ReparseTag == stat.IO_REPARSE_TAG_SYMLINK

    def is_junction(self):
        return self._info.ReparseTag == stat.IO_REPARSE_TAG_MOUNT_POINT

def scandir(path=None):
    """Return an iterator of DirEntry objects for given path."""
    if path is None:
        path = os.getcwd()

    def isdir():
        info = FILE_BASIC_INFO()
        if kernel32.GetFileInformationByHandleEx(
                hFile, FileBasicInfo, ctypes.byref(info),
                ctypes.sizeof(info)):
            return info.FileAttributes & stat.FILE_ATTRIBUTE_DIRECTORY
        return False

    def readdir():
        nonlocal info_class
        if kernel32.GetFileInformationByHandleEx(
                hFile, info_class, buf, ctypes.sizeof(buf)):
            return True
        error = ctypes.get_last_error()
        if error == ERROR_NO_MORE_FILES:
            return False
        elif (info_class == FileIdBothDirectoryRestartInfo and
                error in (ERROR_INVALID_FUNCTION,
                          ERROR_NOT_SUPPORTED,
                          ERROR_INVALID_PARAMETER)):
            info_class = FileFullDirectoryRestartInfo
            return readdir()
        elif error == ERROR_MORE_DATA:
            ctypes.resize(buf, ctypes.sizeof(buf) * 2)
            return readdir()
        raise ctypes.WinError(error)

    def ScandirIterator():
        try:
            while True:
                yield from DirEntry._listbuf(buf, info_class, dirpath)
                if not readdir():
                    break
        finally:
            if close:
                os.close(fd)

    close = False
    try:
        if isinstance(path, int):
            fd = path
            hFile = msvcrt.get_osfhandle(fd)
            if kernel32.GetFileType(hFile) != FILE_TYPE_DISK:
                raise ValueError('if path is a file descriptor, it must '
                                 'refer to a file on a volume device')
            dirpath = ''
        else:
            path = os.fspath(path)
            hFile = kernel32.CreateFileW(
                        os.fsdecode(path), FILE_READ_DATA, FILE_SHARE_READ,
                        None, OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, None)
            if hFile == INVALID_HANDLE_VALUE:
                raise ctypes.WinError(ctypes.get_last_error())
            try:
                fd = msvcrt.open_osfhandle(hFile, os.O_RDONLY)
            except:
                kernel32.CloseHandle(hFile)
                raise
            close = True
            dirpath = path
        if not isdir():
            raise ctypes.WinError(ERROR_DIRECTORY)
        buf = (ctypes.c_char * 65536)()
        info_class = FileIdBothDirectoryRestartInfo
        readdir()
        if info_class == FileIdBothDirectoryRestartInfo:
            info_class = FileIdBothDirectoryInfo
        elif info_class == FileFullDirectoryRestartInfo:
            info_class = FileFullDirectoryInfo
    except:
        if close:
            os.close(fd)
        raise

    return ScandirIterator()

def listdir(path=None):
    """Return a list containing the names of the files in the directory."""
    return [e.name for e in scandir(path)]