ocaml / opam

opam is a source-based package manager. It supports multiple simultaneous compiler installations, flexible package constraints, and a Git-friendly development workflow.
https://opam.ocaml.org
Other
1.24k stars 356 forks source link

Windows: removing of files / folder can fail randomly on Windows #4487

Open MSoegtropIMC opened 3 years ago

MSoegtropIMC commented 3 years ago

There is one stability issue with using opam on Windows: removing of files or folders on Windows is never guaranteed to work, because some Explorer, Windows indexer, Virus scanner, you name it process might have a handle on the file and then remove fails. Unlike Unix where open files are referenced via inode and deleting a file doesn't hinder some other process to still have the file open, on Windows open files are referenced by name and any handle to an open file makes it impossible to delete it. I know that this is a design bug of Windows, but I still have frequent failures of opam cause removing some temporary file failed, even though I have a decent virus scanner and have parameterized it such as to keep relevant folders alone. It is hard to track what happens - if I run Procmon in parallel the issue is much harder to reproduce (I have opam based scripts which fail about 50% otherwise).

I know it is a grotesque hack, but would it be possible to just retry if a remove of a file or folder fails? This is really the only stability issue I have on Windows and I waste a lot of CPU and personal time with this and it is also a major issue for CI.

dra27 commented 3 years ago

Sorry for the slow reply. This "feature" of Windows regularly gets in the way, yes! Retrying is definitely a worthwhile option, but it's also worth trying to get to a place where this doesn't matter as much. One way, for example, is by not depending on precise temporary file names. I remember having to do that in the OCaml testsuite - all the tests created a program called "program", but you got much more reliable performance on Windows by using a different name for each and trusting that at some point Windows would finally erase program1.exe, program2.exe once the virus scanner and whatever else had had enough! Often these "zombie" files can be renamed and even moved, but not deleted so another option in this case would be to move them to a "trash" folder to retry another time.

MSoegtropIMC commented 3 years ago

The removals which come in the way are typically of build folders after a build finished successfully. I guess the main reason to remove these in opam is to save disk space, so renaming wouldn't help. In my experience typically the removal works 1s later.

MSoegtropIMC commented 3 years ago

P.S.: this experience comes from a set of meta build shell scripts I maintained before switching to opam. There I had a removal retry with 1s sleep and the 2nd try usually did work. If I remember right I had a timeout of 5 minutes which was never exceeded unless I had e.g. a file in the build folder open in an editor. In that case it would make sense to show a message "Cannot delete XYZ - do you have it open in an editor?". Not nice but better than failing for reasons users don't understand.

dra27 commented 3 years ago

Indeed - the renaming part of it is because I'd prefer (Windows) opam to waste space rather than waste time - in other words, if the delete fails, I'd just like something which "marks" the file/directory for future garbage collection without blocking other opam operations from proceeding.

MSoegtropIMC commented 3 years ago

Yes, although I believe that dependability is more important than speed. That a remove fails is quite rare on a per package base, but happens frequently when I build say 100 opam packages en batch. Such a build anyway takes 2 hours and if it takes a few seconds longer, I don't really care. If you make a wait 1s retry, I would guess that on average the delay per package is around 10ms.