golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.96k stars 17.53k forks source link

x/tools/gopls/internal/lsp/filecache: TestConcurrency failure on windows #57747

Closed findleyr closed 1 year ago

findleyr commented 1 year ago
#!watchflakes
post <- goos == "windows" && pkg == "golang.org/x/tools/gopls/internal/lsp/filecache"

Just noticed this failure on windows: https://build.golang.org/log/bff6275bfcc6f47e62fef12d9b1607ddfe68a0ca

--- FAIL: TestConcurrency (10.81s)
    filecache_test.go:99: rename C:\Users\gopher\AppData\Local\gopls\606f95f2\TestConcurrency\ae\aef468845cf8c257a7f1cf8fddd43ab5ebdbd0eb629eff15d05306cdcb975ccc.tmp.557174086837030208 C:\Users\gopher\AppData\Local\gopls\606f95f2\TestConcurrency\ae\aef468845cf8c257a7f1cf8fddd43ab5ebdbd0eb629eff15d05306cdcb975ccc: Access is denied.
FAIL
FAIL    golang.org/x/tools/gopls/internal/lsp/filecache 12.369s

Could this be an artifact of the builder, or a problem with the filecache design on windows? We need to understand this, as the filecache is a critical new component of gopls.

CC @adonovan

adonovan commented 1 year ago

From the log and source, it's obviously a failure of MoveFileEx on Microsoft Windows. Seems like this is a fraught operation. Some digging led me to:

We might want to investigate ReplaceFile:

bcmills commented 1 year ago

(CC @golang/windows)

From what I understand, guaranteed-atomic renames are not really a thing in the Win32 API.

The caches in cmd/go instead use a combination of idempotent writes (in the build cache) and file-locking (in the module cache). Both of those do appear to work reliably. Note that for idempotent writes you must avoid os.WriteFile, because that begins by truncating away the existing contents (and truncating away previously-written bytes is not idempotent!).

We also have robustio.Rename, which will generally retry past a rename failure on Windows but seems like kind of a code smell. (It papers over the failure with a retry rather than avoiding it in the first place, and as a result it can add a lot of latency as compared with a more principled approach.)

adonovan commented 1 year ago

We also have robustio.Rename, which will generally retry past a rename failure on Windows but seems like kind of a code smell. (It papers over the failure with a retry rather than avoiding it in the first place, and as a result it can add a lot of latency as compared with a more principled approach.)

filecache is already using robustio.Rename, so this issue is really just an indication that its probabilistic workaround isn't as reliable as we would like.

The caches in cmd/go instead use a combination of idempotent writes (in the build cache)

By idempotent write, I assume you mean that if you write a file whose name is sha256(s) then its contents will be exactly s. We don't currently have that property in filecache, since it would require that you know the hash of the value before you access the cache, whereas in our case we only know the hash of the recipe, not of the cake that would be produced by executing the recipe, and we don't guarantee that recipes are fully deterministic: so long as the cake has the right density, we don't care where the CO2 bubbles are.

and file-locking (in the module cache).

Thanks, I'll look into locking.

From what I understand, guaranteed-atomic renames are not really a thing in the Win32 API.

Yeah. It's sad that this still hasn't been addressed in the quarter century since I first hit this problem. It's not as if Windows exploits the lack of transactional renames to make its directory operations faster than POSIX---quite the opposite in fact, it is notoriously slow. (Some of that is no doubt due to hooks for Antivirus, backup, GUIs, search index updates, and other things that Linux doesn't support, of course.)

Apparently, support for file system transactions (more general than just rename within a directory) was added to MS Windows Vista (2007) but has since been deprecated.

adonovan commented 1 year ago

No less an authority than the late Jim Gray of MSR says in To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem? that

To perform a safe write, a new version of the file is created with a temporary name. Next, the new data is written to the temporary file and those writes are flushed to disk. Finally, the new file is renamed to the permanent file name, thereby deleting the file with the older data. Under UNIX, rename() is guaranteed to atomically overwrite the old version of the file. Under Windows, the ReplaceFile() call is used to atomically replace one file with another.

But I'm unable to reconcile that with these terrifying error codes documented for ReplaceFile:

ERROR_UNABLE_TO_MOVE_REPLACEMENT1176 (0x498) | The replacement file could not be renamed. If lpBackupFileName was specified, the replaced and replacement files retain their original file names. Otherwise, the replaced file no longer exists and the replacement file exists under its original name. ERROR_UNABLE_TO_MOVE_REPLACEMENT_21177 (0x499) | The replacement file could not be moved. The replacement file still exists under its original name; however, it has inherited the file streams and attributes from the file it is replacing. The file to be replaced still exists with a different name. If lpBackupFileName is specified, it will be the name of the replaced file.

Perhaps the operation is atomic within a directory, ACLs permitting, and not otherwise?

bcmills commented 1 year ago

It's complicated. 😅 (See previously #8914.)

Per https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-replacefilew#remarks,

The ReplaceFile function combines several steps within a single function. An application can call ReplaceFile instead of calling separate functions to save the data to a new file, rename the original file using a temporary name, rename the new file to have the same name as the original file, and delete the original file.

From that description, I would suppose that if the file exists, it does have either the old contents or the new contents. However, if the two renames are not a single transaction then there may be an interval after the “rename the original file” step but before the “rename the new file” step in which the file ceases to exist entirely. 🤔

If that gap is possible, then it seems that simultaneous ReplaceFile calls could cause at least one of the calls to fail. Consider the interleaving:

So I would expect that ReplaceFile would still at least be prone to spurious failures, which would still have the latency problems associated with retries.

gopherbot commented 1 year ago

Change https://go.dev/cl/461855 mentions this issue: internal/robustio: use SetFileInformationByHandle(FileRenameInfo)

adonovan commented 1 year ago

Inspired by https://www.youtube.com/watch?v=uhRWMGBjlO8&t=2161s I tried an approach based on SetFileInformationByHandle(FileRenameInfo), but that doesn't seem to be any better. See https://go.dev/cl/461855.

I notice that LLVM's solution to the same problem involved trying up to 200 retries (!!):

   // We normally expect this loop to succeed after a few iterations. If it
   // requires more than 200 tries, it's more likely that the failures are due to
   // a true error, so stop trying.
   for (unsigned Retry = 0; Retry != 200; ++Retry) {

Sigh.

gopherbot commented 1 year ago

Change https://go.dev/cl/461800 mentions this issue: gopls/internal/lsp/filecache: use file locking on Windows

gopherbot commented 1 year ago

Change https://go.dev/cl/463776 mentions this issue: internal/lockedfile: fix build constraints on solaris