rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.6k stars 12.62k forks source link

fs::remove_dir_all rarely succeeds for large directories on windows #29497

Open Diggsey opened 8 years ago

Diggsey commented 8 years ago

I've been trying to track this one down for a while with little success. So far I've been able to confirm that no external programs are holding long-standing locks on any of the files in the directory, and I've confirmed that it happens with trivial rust programs such as the following:

use std::fs;

fn main() {
    println!("{:?}", fs::remove_dir_all("<path>"));
}

I've also confirmed that deleting the folder from explorer (equivalent to using SHFileOperation), or using rmdir on the command line will always succeed.

Currently my best guess is that either windows or other programs are holding transient locks on some of the files in the directory, possibly as part of the directory indexing windows performs to enable efficient searching.

Several factors led me to think this:

Maybe there should be some sort of automated retry system, such that temporary locks do not hinder progress?

edit: The errors returned seem somewhat random: sometimes it gives a "directory is not empty" error, while sometimes it will show up as an "access denied" error.

retep998 commented 8 years ago

I think this is just race conditions at work. When you delete a file another program might get a notification and check out what happened and that might hold a temporary lock on something. Or it might be that the previous deletions weren't actually done by the time you tried to delete the directory.

petrochenkov commented 8 years ago

Hm, https://github.com/CppCon/CppCon2015/blob/master/Tutorials/Racing%20the%20Filesystem/Racing%20the%20Filesystem%20-%20Niall%20Douglas%20-%20CppCon%202015.pdf slide "3. Deleting a directory tree" may be related. I accidentally watched this presentation yesterday. If I understood correctly, DeleteFileW used in remove_dir_all doesn't actually delete a file, but only marks it for deletion, it still can be alive when we try to delete the parent directory causing the "directory is not empty" error.

Diggsey commented 8 years ago

@petrochenkov Great link - that seems like it's exactly the problem I'm experiencing.

It would be great to see some of these portable race-free idioms implemented in std.

alexcrichton commented 8 years ago

I'd be interested to dig a bit more into this to see what's going on. According to the DeleteFile documentation the only reason the file would be flagged for deletion would be if there are active open handles, which seems like a legitimate race condition? Note though that the same docs also recommend using SHFileOperation for recursively deleting a tree.

Those slides are certainly an interesting read though! I'd be a little wary of putting it into the standard library as the "transactional semantics" are a little weird. For example if you successfully rename a file to a temporary directory and then fail to delete it, what happens?

It may be the case that the best implementation for this function on Windows is to just use SHFileOperation, and that may end up just taking care of these sorts of issues on its own.

pitdicker commented 8 years ago

I will give fixing this a try.

On Windows XP we could use SHFileOperation, and on Vista and up IFileOperation. For Windows store applications there is no easy function. But I would still like to try implementing it by walking the directories and deleting recursively.

Current problems with remove_dir_all:

And what I hope will fix it:

I think we know deleting will fail when opening the file with FILE_FLAG_DELETE_ON_CLOSE fails, so before moving.

pitdicker commented 8 years ago

O, and I walked against this bug when rustbuild (is that the current name) failed after make clean because of this.

alexcrichton commented 8 years ago

@pitdicker thanks for taking this on! We may want to discuss this a bit before moving much more on a solution. One question is how principled should something like fs::remove_dir_all be. For example if fs::remove_dir_all automatically handled read-only files, should fs::remove_file also do the same? Currently we have this "duality" where some fs functions are straight counterparts to the underlying system operations, but some are somewhat fancier like create_dir_all and remove_dir_all. We'd just want to make sure to have an explicitly drawn line here. Also note I'm not really thinking of an RFC here, just some more discussion before jumping to a PR.

I personally feel that remove_dir_all is fine to be fancy and do lots of implicit operations. I would continue to want, however, that remove_file and remove_dir are as straightforward as possible (e.g. don't mirror what's happening here unless it's a similarly small set of operations as to what happens today). Your proposed strategy seems reasonable to me at first glance, and I wouldn't mind reviewing more over a PR.

Curious what others think, though? cc @rust-lang/libs

Diggsey commented 8 years ago

@alexcrichton Hmm, I would prefer remove_file and remove_dir to try to give the same guarantees they have on linux if possible. Specifically that if they succeed, the file or directory should be completely gone.

It's inevitable that people will build their own abstractions over the top of these primitive operations, and I'd very much like to avoid these kind of bugs where it's sufficiently rare that it's almost impossible to track down the true cause of the problem.

retep998 commented 8 years ago

Specifically that if they succeed, the file or directory should be completely gone.

This wouldn't really work because files don't necessarily get deleted right away. You can't just move the file somewhere else and then delete it because there isn't always a reasonable place on the same volume to move the file to. Sitting around and indefinitely waiting for the file to vanish isn't ideal either. I'd personally just use IFileOperation for remove_dir_all and call it a day.

Diggsey commented 8 years ago

What if the file is opened with FILE_FLAG_DELETE_ON_CLOSE, no share flags specified, and then is immediately closed? Successfully opening the file means that you have exclusive access, and so closing it should delete the file.

retep998 commented 8 years ago

No share flags means nobody else can open it to read/write/delete it. However another handle can still be opened without read/write/delete permissions. Also what happens if you fail to open the file with no share flags? Does it error or fallback to the normal deletion method that provides no guarantees?

pitdicker commented 8 years ago

Great to see so many comments. Just what I hoped for!

@alexcrichton I completely agree to keep remove_file and remove_dir simple, and only let remove_dir_all do magic. Actually removing read-only files is pretty difficult when you also don't want to break hard links. http://stackoverflow.com/questions/3055668/delete-link-to-file-without-clearing-readonly-bit is an example, but I am sure I found some better way somewhere.

@Diggsey I would also like cross-platform consistency, but sometimes Windows is just to different... I think the standard library should at least expose primitives that only need one system call. Otherwise we lose performance and maybe get more vulnerable to race conditions. Maybe we can add a remove_no_matter_what in the future :)

@retep998 I have not yet written the code, so I don't know if this will really solve the problem... But if this doesn't work out I am all for IFileOperation. Meanwhile it is a nice way for me and maybe the standard library to get better at avoiding filesystem races.

I would have to test what happens with FILE_FLAG_DELETE_ON_CLOSE, but in theory if the file opens successfully we only have to move it to avoid a race of a couple of milliseconds. Directly after opening the file no one else can open the file anymore (it is not a share flag).

What is difficult is where to move a file / dir to. I think the parent of the dir we are removing is good enough. It should be on the same volume, and I think we also have write permission to it because otherwise deleting a dir is not possible (wild speculation here, it is late :))

brson commented 8 years ago

I feel ok about going to extra effort to make remove_dir_all work as expected.

alexcrichton commented 8 years ago

@pitdicker perhaps you could draw up a list of the possible implementation strategies for these functions? That'd also help in terms of weighing the pros/cons of each strategy and we could get a good idea of which terminology is associated with what. So far it sounds like IFileOperation may be the best option here, but I'm personally at least not following what FILE_FLAG_DELETE_ON_CLOSE would mean, especially in the case that you can't open it as someone else does.

pitdicker commented 8 years ago

To be fair IFileOperation is a bit to diffcult for me to implement. It should "just work" when removing directories. But as far is I can see it is an entirally different kind of API. For example it does not work on handles but IShellItems, and is mostly made for gui applications.

The steps I proposed are more ugly. They use the same simpler but more low-level APIs as the rest of the standard library. But there is a little window where files may temporarily show up in the parent directory of the one we removed. But these files would otherwise be the ones that cause remove_dir_all to fail currently, which is not all that often.

On Windows deleting a file is not guaranteed to be complete before the function returns. Instead a file is scheduled for deletion, when no one else has it open anymore. When Unix on the other hand deletes the file, it looks really deleted. But deletion of the inode is scheduled until everyone is done with it.

FILE_FLAG_DELETE_ON_CLOSE is a flag we can set when opening a file, together with DELETE permission as access mode. It will schedule the file (or directory) for deletion. When the flag is set (i.e. as soon as open succeeds), it is not possible for others to open the file anymore. The only question is what happens if others already have the file open. If they did not have the file open with FILE_SHARE_DELETE we should fail to open it (as we do not have DELETE permission).

Using this flag should be identical to using DeleteFile. The only reason I want to use it, is that it keeps a hanle open. And with this handle the file can be moved to a temporary location to not block removing the parent dir.

But I am not sure of anything yet. Somehow I just can't get it to work the renaming part yet, so all this is theory, not tested :(.

Diggsey commented 8 years ago

Having looked at the IFileOperation, I agree with @pitdicker that it doesn't seem like the right kind of API for low-level filesystem operations: it's very much oriented towards interactive use. (The .net framework doesn't use it either, but then Directory.Delete suffers from the exact same problem!)

I'm starting to think that maybe just using the current implementation and retrying on failure is the best option. http://stackoverflow.com/a/1703799 (The second code example has the advantage that it only retries more than once if some progress is made, and retries more often, for more deeply nested directories)

pitdicker commented 8 years ago

YES! I've got the manual method working.

Now to clean up my messy code, and do a lot of testing...

pitdicker commented 8 years ago

@Diggsey Thanks for looking up how .NET handles this!

And I like the simplicity of just retrying on failure. But I don't like that it needs a little luck. So I prefer moving files the instant before they are deleted.

timvisee commented 7 years ago

Has a fix already been applied to std, or do we have a working fix available yet? I'm currently bumping into the same issue on Windows, and this issue seems to be open for quite a while. Sadly, the problem seems to be quite inconsistent on my machine.

Also, I don't think that retrying to delete a file a few times sound like a good solution to put into std.

orvly commented 7 years ago

I just ran into a related problem myself, where after calling remove_dir_all on a directory which had just a single empty sub-directory, it completed successfully, but then trying to re-create this directory hierarchy immediately failed with Access Denied.
(I imagine this should be a rather common scenario with testing code which relies on the filesystem, first cleaning up any artifacts from the previous run and then re-creating them).

The problem is basically not just with remove_dir_all, but with any attempt to delete a directory - it might not be executed immediately, so any action done right after it, which assumes it doesn't exist, might fail.

But after going to the issue referenced right above this comment (995 in rustup) I found out about this new "remove_dir_all" crate from @aaronepower (hosting just this single function), which uses a trick - it moves the directory first, then deletes the moved directory (as documented extensively there).
That solved my problem as well as the more general problem of deleting a directory on Windows. remove_dir_all crate

Could its implementation replace the one in std::fs?

steveklabnik commented 5 years ago

Triage: I'm not aware of any movement on this issue.

hntd187 commented 5 years ago

I can confirm this still happens @steveklabnik

Error: Os { code: 145, kind: Other, message: "The directory is not empty." }

Just a folder with some plain text json files in it. Also, not a large directory or have any nested structure.

XAMPPRocky commented 4 years ago

Hey, so there is a solution to this. Someone needs to merge the remove_dir_all code into the standard library. This is a crate that created as a quick workaround for crates from an implementation from #31944 that never got merged. It's been pretty stable over the past few years, and is used in cargo and rustup.

Since creating the crate, I no longer have a Windows machine. As a result I do not intend to maintain the crate. It would be great if someone could port the code into std so that everyone can have a reliable implementation on windows. :)

ehuss commented 4 years ago

I would not consider the current remove_dir_all crate as a complete solution to this. It is better, but it still frequently fails in some situations. See rust-lang/cargo#7042 as an example where clearing Cargo's test suite doesn't succeed. See that PR for a link to cygwin's implementation which always works for me, and some commentary from a former cygwin developer.

I'm not opposed to merging it, just saying it isn't a complete fix.

programmerjake commented 4 years ago

Some potentially useful reference code: Wine's implementation of SHFileOperation

ChrisDenton commented 3 years ago

I think there are (at least) three issues here:

  1. The main issue is that a file can be pending deletion for awhile, which prevents deleting its parent directory. This should already be fixed by recent versions of Windows but a solution that works for older versions would be welcome.
  2. If a process is still running then its .exe file cannot be deleted. I've seen this cause issues where a build system tries to cleanup exe files and fails. More generally, a file could be held open without the "FILE_SHARE_DELETE" share mode (which means there's a file lock preventing deletion attempts until after the file handle is closed).
  3. People thinking the DOS FILE_ATTRIBUTE_READONLY is the same as the POSIX read permission. This is especially an issue with applications ported from Linux. This ship has sailed at this point so maybe we should just force delete readonly files.

I think the first can be mitigated slightly by ignoring any ERROR_DIR_NOT_EMPTY messages at first, then recursively deleting any left over directories afterwards. Maybe in a retry loop that again skips ERROR_DIR_NOT_EMPTY until the final attempt.

I'm uncertain if the second can be sensibly solved by Rust's std. At least not with the current API. It's just a matter of waiting. For example, the process may need to finish running and then the .exe file handle has to be closed. This could take any amount of time. I guess we could retry a few times before giving up?

The third can be solved by either unsetting readonly attributes or using the FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE flag (on OS versions that support it).

XAMPPRocky commented 3 years ago

I'm uncertain if the second can be sensibly solved by Rust's std. At least not with the current API. It's just a matter of waiting. For example, the process may need to finish running and then the .exe file handle has to be closed. This could take any amount of time. I guess we could retry a few times before giving up?

Just my opinion, but retrying seems fine to me, this is already what some APIs do, such as symlinking in std on Windows, if it receives a certain error from Windows it just tries again with some other flags. The same could work here. I think the main thing to be aware of with retrying though is for it to still feel responsive and design it so that someone can't cause a denial-of-service attack by getting this API to repeatedly try to delete a running exe (such as itself).

I think you'd want something like having just a single timeout for a fixed amount of time, only trigger timeout countdown if we've failed, and to return ErrorKind::TimedOut or similar if it failed.

You could maybe also probably be smarter about it and do something like collect all entries that fail, and try them a second time after you've deleted everything else, so that it's more likely that the OS has had enough time to finish up whatever it was doing.

pwinckles commented 2 years ago

It looks like one of the commits that addressed the remove_dir_all security vulnerability may have also resolved this issue, new retry logic starting on line 1010. I haven't had a chance to test it out on a Windows system yet.

rbtcollins commented 2 years ago

@ChrisDenton Looks like FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE is only available for ZwSetInformationFile - Cygwin uses the underlying NT calls, but I had the impression rust is trying to avoid getting that low level.

@pwinckles The loop there may reduce the incidence, but as it is reentering into the entire thing its going to repeat IO for the depth of the point where the error occurred, though it will at least bail out fast. I wouldn't want to use that version on rustups deletion of docs trees, for instance; though I am still interested in getting a reliable version into the core.

ChrisDenton commented 2 years ago

@rbtcollins FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE is in the standard headers (WinBase.h) and is also in the windows-rs crate.

Btw, the code currently in the standard library was written with some amount of haste so isn't necessarily the best example. I do plan to get back to it but lately I've been distracted with other issues.

rbtcollins commented 2 years ago

@ChrisDenton thats super interesting - and not visible on the MSDN doc pages :/

I meant no negativity about the current code - it is great the CVE was solved, and there time mattered a lot... and this bug is really (to my mind) about figuring out what is sensible for the stdlib - tradeoffs between complexity, generality, least surprise etc abound.

ChrisDenton commented 2 years ago

@ChrisDenton thats super interesting - and not visible on the MSDN doc pages :/

The newer information classes are listed (e.g. FileDispositionInfoEx and FileRenameInfoEx) but there are not yet platform version information or dedicated pages listing the constants as there are with the others. It probably just needs someone to put in the work of writing the docs (which tbh is rarely the most popular task). In any case being in public headers and used by MSVC STL is quite a vote of confidence that it isn't going away.

I meant no negativity about the current code - it is great the CVE was solved, and there time mattered a lot... and this bug is really (to my mind) about figuring out what is sensible for the stdlib - tradeoffs between complexity, generality, least surprise etc abound.

Sorry! I didn't take what you said as negativity. I was meaning that more as a warning that, while I'm happy it fixes the CVE, I would still like to see the code cleaned up and improved.

rbtcollins commented 1 year ago

@ChrisDenton so FileDispositionInfoEx notes that deleting a file with posix semantics - " When FILE_DISPOSITION_POSIX_SEMANTICS is set, the link is removed from the visible namespace as soon as the POSIX delete handle has been closed, but the file's data streams remain accessible by" - does this mean that the directory delete will recieve ERROR_DIR_NOT_EMPTY while those processes with handles open continue to consume the file? Or is it genuinely unlinked from the directory and thus the directory is deletable?

ChrisDenton commented 1 year ago

@rbtcollins The HANDLE that sets the disposition must be closed for it to take full effect. On setting FILE_DISPOSITION_POSIX_SEMANTICS, but before closing, the file is locked such that the file can no longer be opened. However, it still remains in the directory and so attempting to delete the directory will produce ERROR_DIR_NOT_EMPTY. So essentially setting the disposition should be immediately followed by a close.

Other file handles to the same file can remain open. They have no effect at all on the POSIX delete (unless of course they lock the file).

rbtcollins commented 1 year ago

Thank you @ChrisDenton .

I'm having a little trouble figuring out the version support for this comment of yours: FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE flag (on OS versions that support it). It looks like its definitely present in Windows 10.1809. Is there some easy lookup matrix for these things? In particular to figure out whether to use it conditionally, or just document that successful deletion of readonly trees would require that version ...

ChrisDenton commented 1 year ago

The documentation is pretty limited atm so the best bet is the C/C++ headers which indeed indicate that it's available on 1809 and later (aka _WIN32_WINNT_WIN10_RS5).

I'd suggest trying the FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE flag first, checking the error returned, and then progressively falling back if necessary. The std currently checks for any of ERROR_NOT_SUPPORTED, ERROR_INVALID_FUNCTION or ERROR_INVALID_PARAMETER and then falls back to non-posix delete. A more involved solution could fallback to dropping FILE_DISPOSITION_IGNORE_READONLY_ATTRIBUTE first and if that works then remembering (in an atomic) that the ignore readonly attribute isn't supported on this OS.