llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.24k stars 11.66k forks source link

Windows llvm::sys::fs::remove algorithm is *VERY* subtly incorrect, resulting in unpredictable "directory not empty" errors #27760

Open llvmbot opened 8 years ago

llvmbot commented 8 years ago
Bugzilla Link 27386
Version trunk
OS Windows NT
Attachments The MSBuild output of two tests failing because of this bug. There's a third failure (tools/llvm-symbolizer/ppc64.test), unrelated to this bug., By the time I check the folder, it is empty. Windows "marks" files for deletion, and then later actually unlinks/deletes them. Here, the contents of this directory weren't actually deleted by the time that a unittest tried to delete the directory.
Reporter LLVM Bugzilla Contributor
CC @AaronBallman,@d0k,@compnerd,@dyung,@hfinkel,@pogo59,@rnk,@vedantk

Extended Description

Under heavy IO load, removing directories on Windows with llvm::sys::fs:remove will fail unpredictably.

I've just observed the TestWriteWithSystemEncoding test in ./llvm/unittests/Support/ProgramTest.cpp failing because of this. That test creates a kinda-unique directory named "program-test-%%%%%%" (where % is a pseudorandom alphanumeric), creates a file named "international-file.txt" in it, and roundtrip-tests some text (writes some text with one encoding, reads it with another, checks if it's not garbled). After that, it deletes the file (llvm::sys::fs::remove(file_pathname.str())) and then the directory (llvm::sys::fs::remove(TestDirectory.str())). Essentially, when Windows "Deletes" a file, it doesn't actually Delete the file; when the system is under load, "international-file.txt" may STILL be in the directory by the time of llvm::sys::fs::remove(TestDirectory.str()).

The details of WHY this will fail are very counter-intuitive, and you should see "CppCon 2015: Niall Douglas “Racing The File System" (https://youtu.be/uhRWMGBjlO8?t=7m35s) for specifically why the code is wrong. Lots of programs get this wrong, Douglas mentions CPython, and I've seen it in Chromium.

Sidenote: The documentation for RemoveDirectory (https://msdn.microsoft.com/en-us/library/windows/desktop/aa365488.aspx) is kinda crappy, because it says "RemoveDirectory removes a directory junction, even if the contents of the target are not empty; the function removes directory junctions regardless of the state of the target object.", but "Creating and Deleting Directories" (https://msdn.microsoft.com/en-us/library/windows/desktop/aa363872.aspx) says: "To delete an existing directory, use the RemoveDirectory or RemoveDirectoryTransacted function. Before removing a directory, you must ensure that the directory is empty and that you have the delete access privilege for the directory. To do the latter, call the GetSecurityInfo function."

llvmbot commented 8 years ago

Do you have benchmarks comparing its performance to the existing implementation?

Actually, no. Right there, that's enough of a reason for me to agree.

Given that IFileOperation is basically the workhorse of file deletion and is what you get every time you press Delete in windows explorer, I have to imagine that Microsoft has tried very hard to optimize it to the best of their ability, and that if it's slower than some other hand-rolled mechanism, there is a good reason for it. Otherwise why wouldn't they use the faster mechanism?

I'm not saying it's impossible they have a poor implementation of IFileOperation, but I think it would be worth doing due diligence and producing some benchmarks to solidify this belief. There's point committing complicated code if simple code will do the job.

llvmbot commented 8 years ago

Do you have benchmarks comparing its performance to the existing implementation? I have had probelms deleting from explorer too, but the problems I've had have not been of a racy nature. I've never experienced a problem where if I try to delete something N times it fails the first N-1 times and succeeds the N'th time. The problems have always been "this file appears to be locked for no reason and there's nothing I can to delete it".

Given that IFileOperation is basically the workhorse of file deletion and is what you get every time you press Delete in windows explorer, I have to imagine that Microsoft has tried very hard to optimize it to the best of their ability, and that if it's slower than some other hand-rolled mechanism, there is a good reason for it. Otherwise why wouldn't they use the faster mechanism?

I'm not saying it's impossible they have a poor implementation of IFileOperation, but I think it would be worth doing due diligence and producing some benchmarks to solidify this belief. There's point committing complicated code if simple code will do the job.

llvmbot commented 8 years ago

Maybe I'm missing something, but how about just using the Windows function whose entire purpose is to recursively remove a directory and all the files in it?

SHFILEOPSTRUCT FileOp = {0}; FileOp.hwnd = nullptr; FileOp.wFunc = FO_DELETE; FileOp.pFrom = ; FileOp.pTo = nullptr; FileOp.fFlags = FOF_NO_UI; SHFileOperation(&FileOp);

Mainly, It's horrifically slow. I'm also not confident that it's robust just because explorer uses it... I've had issues deleting directory trees in explorer under heavy load in the past.

AaronBallman commented 8 years ago

If we don't care about supporting pre-Vista, then the "recommended" way to do this is through the IFileOperation interface, btw.

We don't care about supporting pre-Vista. We only care about Windows 7 and later (see WindowsSupport.h:30).

llvmbot commented 8 years ago

If we don't care about supporting pre-Vista, then the "recommended" way to do this is through the IFileOperation interface, btw.

Note this is the same code explorer uses when you select a directory and hit Delete, so it definitely is robust under system load.

llvmbot commented 8 years ago

Maybe I'm missing something, but how about just using the Windows function whose entire purpose is to recursively remove a directory and all the files in it?

SHFILEOPSTRUCT FileOp = {0}; FileOp.hwnd = nullptr; FileOp.wFunc = FO_DELETE; FileOp.pFrom = ; FileOp.pTo = nullptr; FileOp.fFlags = FOF_NO_UI; SHFileOperation(&FileOp);

llvmbot commented 8 years ago

Minor comments, the code needs to keep the UTF16 (widenPath) support, see stuff at end of WindowsSupport.h and maybe sys::fs::createTemporaryFile (or the lower level createUniqueEntity) could be reused to create the temporary filename.

llvmbot commented 8 years ago

(part 2, what I think the patch will do)

Since LLVM doesn't use the NT kernel/native API, we'll need to do it a little bit differently.

The biggest difference is that we will use CryptGenRandom, the Win32 public facing API for generating random bytebuffers - internally it calls RtlGenRandom - and it's already used in LLVM (see ./llvm/lib/Support/Windows/Process.inc function Process::GetRandomNumber).

We will:

  1. Open the file with CreateFileW (with dwShareMode set to FILE_SHARE_DELETE, dwFlagsAndAttributes set to FILE_FLAG_DELETE_ON_CLOSE)
  2. Create a 64 character SmallVector
  3. Get a handle to a crypto provider with CryptAcquireContextW, store it in a ScopedCryptContext, and pass that to CryptGenRandom to fill the SmallVector with random data
  4. Rename the file with MoveFileW, passing the SmallVector.data() casted to a PCWSTR (??)
  5. Delete the file with DeleteFileW (not sure why AFIO explicitly deletes the file, after opening it with FILE_DELETE_ON_CLOSE, I'm looking into this)
  6. Close the file handle.

Questions remaining:

  1. Should we use llvm::sys::fs::rename instead of MoveFileW?
  2. Should we handle paths longer than "MAX_PATH"? (Which is totally supported, with a funky syntax)
  3. What's going on in AFIO's to_hex_string? All the bit shifting is giving me a headache. I'm going to ask Niall, the docs say that the bitshifting is part of packing bytes into the string, but I'm not sure why the bitshifting is necessary?
  4. Do we need to call RemoveDirectoryW to delete an empty directory, or will DeleteFile do?
llvmbot commented 8 years ago

So here's what I'm thinking about the patch:

It'll use the same basic algorithm that Niall Douglas used in Boost AFIO v1 (see: https://github.com/BoostGSoC13/boost.afio/blob/b35ed5b9783a4dfde9b3d611dfb940739f303c97/include/boost/afio/v2/detail/impl/afio_iocp.ipp#L744), except using the Win32 API instead of the NT kernel/native API.

(separating into two comments so that my text doesn't evaporate)

AFIO:

  1. Opens the file with NtCreateFile (with ShareAccess set to FILE_SHARE_DELETE, CreateOptions set to FILE_DELETE_ON_CLOSE).
  2. then (in random_string) uses the NT kernel/native API RtlGenRandom to generate a cryptographically random bytebuffer(in random_fill)
  3. converts the bytebuffer to a std::string (in to_hex_string)
  4. renames the file to the random name with NtSetInformationFile (with FileInformationClass set to FileRenameInformation)
  5. deletes the file by calling NtSetInformationFile with FileDispositionInformation (where FILE_DISPOSITION_INFORMATION.DeleteFile is set to true)
llvmbot commented 8 years ago

So, is this a bug in remove or in the users?

Should we just change ProgramTest.cpp to not use a directory and document the restriction on remove? Anyone that really needs to delete directories then should use rename + remove.

I think I'll try and write a patch to fix this (in llvm::sys::fs::remove). It's better for the function to just do the right thing, because this issue is subtle enough to begin with that it's unlikely someone will get it right in their own implementation.

llvmbot commented 8 years ago

If I understand correctly, the problem happens even when erasing one file. When the system is loaded Windows may defer the actual file deletion. This may lead to all sorts of problem beside erasing the containing directory since remove() user expects the file be gone upon return.

Just below Windows\Path.inc remove(), rename() waits in a loop (with sleep of 1ms):

// Retry while we see recoverable errors. // System scanners (eg. indexer) might open the source file when it is written // and closed.

bool TryReplace = true;

for (int i = 0; i < 2000; i++) { if (i > 0) ::Sleep(1);

if (TryReplace) {

...

so remove() may do the same, waiting for the file/path to be actually deleted before return. Or it can do the rename/delete combo which makes more sense to avoid the wait loop.

llvmbot commented 8 years ago

So, is this a bug in remove or in the users?

Should we just change ProgramTest.cpp to not use a directory and document the restriction on remove? Anyone that really needs to delete directories then should use rename + remove.

llvmbot commented 8 years ago

The test had similar problem to what you describe: a file deleted by rm still existed when the next command is run.

At first some workarounds were tried to avoid removing the file and immediately using it but it still often failed and was disabled on Windows:

http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150608/130450.html

The rm command may have the same problem you found in llvm::sys::fs:remove and the other programs mentioned.

llvmbot commented 8 years ago

Maybe bash rm on Windows has the same problem?

http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150608/130437.html

Heh, I wouldn't be surprised. I'm a bit rusty on my shell scripting - what do the changed lines in that commit do?

llvmbot commented 8 years ago

Maybe bash rm on Windows has the same problem?

http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150608/130437.html

Heh, I wouldn't be surprised. I'm a bit rusty on my shell scripting - what do the changed lines in that commit do?

llvmbot commented 8 years ago

Maybe bash rm on Windows has the same problem?

http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20150608/130437.html

llvmbot commented 8 years ago

This is the source code of llvm::sys::fs::remove, for your convenience.

llvmbot commented 8 years ago

Note that the function with the bug is in ./llvm/lib/Support/Windows/Path.inc, NOT in ./llvm/lib/Support/Unix/Path.inc

Sidenote: maybe "System Libraries" is the proper component?