Closed z2oh closed 5 months ago
CC: @rnk @aganea
I do find this to be a nicer solution as we are more precisely able to identify the scenario rather than guessing. I expect that ntdll.dll to be mapped in, so the additional linkage shouldn't be too expensive, but I know that @aganea did prefer to minimise the dependencies.
I do find this to be a nicer solution as we are more precisely able to identify the scenario rather than guessing. I expect that ntdll.dll to be mapped in, so the additional linkage shouldn't be too expensive, but I know that @aganea did prefer to minimise the dependencies.
Yes it is already mapped at runtime through advapi32
. I think the cost for adding ntdll
would be negligible.
My understanding is that RtlGetLastNtStatus()
isn't exposed because it's internal detail that wraps NtCurrentTeb()->LastStatusValue
. I think in most cases NtCurrentTeb()->LastStatusValue
is in sync with LastErrorValue
(which GetLastError()
returns) but there's no guarantees.
One thing I dislike maybe with the above diff is that if status()
returns ERROR_FILE_NOT_FOUND
and in some other usage we create a file (with the same name) immediately afterwards, that creation might fail if the file wasn't deleted yet by the OS. The usual (silly) way of avoiding that is to loop until the GetFileAttributesW
really tells you that the file is deleted. But that might take up to several seconds, it is all dependent on internal kernel threads/queues.
I suppose we could either return a new type file_type::being_deleted
and handle the loop externally; or loop in status()
if we detect STATUS_DELETE_PENDING
and until GetLastError()
returns ERROR_FILE_NOT_FOUND
or something along those lines.
Also, the call to CreateFileW
in status()
could exhibit the same issue?
+@mstorsjo
if status() returns ERROR_FILE_NOT_FOUND and in some other usage we create a file (with the same name) immediately afterwards, that creation might fail if the file wasn't deleted yet by the OS.
Isn't this true generally? In that there will always be a window between the return of status and the call to create file in which another process could have done something that would cause create file to fail? Though, I appreciate this is probably uncommon in practice and this patch would increase the likelihood of this case.
or loop in status() if we detect STATUS_DELETE_PENDING and until GetLastError() returns ERROR_FILE_NOT_FOUND or something along those lines.
This seems reasonable. I'd expect that window to be small, so the call is likely to succeed in a few iterations (or there's an actual error). I'll give this a try and make sure that assumption is true.
Also, the call to CreateFileW in status() could exhibit the same issue?
Hmm good point. I never encountered this in my testing.
I like the idea of using NtCurrentTeb()->LastStatusValue
instead to get at the lower level NT error code. I think treating files pending deletion as "not found" is reasonable.
The only way I can see this going wrong is if a program checks for existence, gets file-not-found, and then proceeds to attempt to create a file at this location rather than attempting a different file name. However, such a program should recover from failing to create a unique file anyway after the file creation attempt, so that's not a big concern.
I think I also would favour NtCurrentTeb()->LastStatusValue
rather than explicitly linking against an undeclared function from ntdll.
I wonder if @cjacek have any wisdom to share here, about what's portable in practice and what's not, when it comes to constructs like these.
Generally yes, I think that NtCurrentTeb()->LastStatusValue
should be safe. I used Wine testing infrastructure to confirm that, see: https://testbot.winehq.org/JobDetails.pl?Key=145102. I modified existing Wine tests to check LastStatusValue
(see patch.diff
on that page) and it behaves as expected on all Windows versions from 7 to 11.
ntdll.dll
is loaded to all Windows processes, that's where the startup code lives and that's what makes syscalls possible on Windows. Many of its functions are undocumented and claimed as "internal", but in practice so many applications depend on them one way or another that it's unrealistic for future Windows to change that fundamentally. The same applies to TEB layout.
RtlGetLastNtStatus
is just a tiny helper that returns LastStatusValue
from TEB. LastStatusValue
is set whenever NT-style error status is converted to win32-style error using RtlNtStatusToDosError. Most of kernel32 functions are in fact relatively thin wrappers around ntdll functions and need to perform error code conversion on error. GetFileAttributesW does exactly that.
(Wine is based on black box testing, so the actual Windows code almost surely looks different, but externally observed behavior should match; compatibility varies between functions, but this one is extremely popular, so it's pretty well tested and it's relatively safe to assume that Windows is similar enough).
BTW, the delay in deletion comes from Windows semantic that the file is not removed as long as any process has any handle to the file still open, so it may be postponed indefinitely. (And when the last handle is closed, I'd expect deletion to be reflected in VFS immediately, without an additional time window when no handles are opened but file still being considered as in a process of removing, but I didn't test that to confirm).
Thank you so much for the insight! I'll switch to using NtCurrentTeb()->LastStatusValue
and file a PR.
I hit a snag here: the TEB
structure in the Windows SDK doesn't expose its members, unlike the one defined in Wine. Given this, perhaps its better to call RtlGetLastNtStatus
? Although I appreciate the downsides here. I am surprised at how tricky it is to get at the underlying NTSTATUS
.
I would assume the error occurs on this line? https://github.com/apple/llvm-project/blob/next/clang/lib/Index/IndexUnitWriter.cpp#L247
With the domain knowledge, the llvm::errc::permission_denied
seems to make sense in this case, it simply means "someone else (another process) is taking care of that unit file, I (this process) don't have to worry about it". In that case it'd best maybe to return true?
if (std::error_code EC = llvm::sys::fs::status(UnitPath.c_str(), UnitStat)) {
if (EC == llvm::errc::permission_denied)
return true;
if (EC != llvm::errc::no_such_file_or_directory) {
llvm::raw_string_ostream Err(Error);
Err << "could not access path '" << UnitPath
<< "': " << EC.message();
return std::nullopt;
}
return false;
}
That error could mean other things, like "I don't have ACL permissions to access that file" but regardless there's nothing more that this process can do in that case.
Now like you suggest, we could check RtlGetLastNtStatus()
and act on STATUS_DELETE_PENDING
. But if we return llvm::errc::no_such_file_or_directory
from status()
in that case, the current process will do more work uselessly, since the other process is probably in currently moving out of the way the old unit file, to then move the temporary file into place. That is, https://github.com/llvm/llvm-project/blob/main/llvm/lib/Support/Windows/Path.inc#L519-L523. But maybe it's ok to occasionally do double the work, in two different processes.
It feels overall that there's a higher level synchronization mechanism missing in the index generation. The process that wants to write the unit should acquire a global mutex/named semaphore, to advertise other processes that it is writing that unit, and others should not attempt to write it.
Getting back to status()
, from an API level, if we returned llvm::errc::no_such_file_or_directory
could be a bit misleading. Like @cjacek pointed out above, it could take a while until the file is deleted. If we advertise "the file is not there", other API consumers might want to create that file, and that creation might fail if the file is still pending to delete. I find for those cases llvm::errc::permission_denied
to better reflect the reality. One thing we could do however since we know about STATUS_DELETE_PENDING
, is looping/blocking in status()
for a bit, like rename_handle()
does, in a hope that the status will change to something else. Which it probably will in 99% of the cases. But the caller will have to deal with llvm::errc::permission_denied
anyway if we want 100% certainty. Or we could just return a new error code ("delete_pending") that better reflects the reality, and lets API users deal with this uncertainty.
Yes, that is the line in question and your patch would work. This is similar to https://github.com/apple/llvm-project/pull/8577, although returning true
is better than my approach to treat the errors the same way.
I'm hesitant about the second interpretation of the error:
That error could mean other things, like "I don't have ACL permissions to access that file" but regardless there's nothing more that this process can do in that case.
It's true that there's nothing else the process could do here, but this error condition should probably be fatal, unlike the ambiguous delete_pending
case. In this particular care, there will be a failure elsewhere when trying to actually write out the index file if the process really doesn't have permissions, but it seems wrong to me to generally to conflate these two conditions (as the Windows API does).
I agree it's not great to mislead consumers by reporting that the file is deleted when it is not, especially when it may never be deleted due to a stuck process or similar. Looping in status
to cover 99% of the cases seems reasonable from an API perspective. If the loop completes, I think it's fair for the caller to fail if the error is actually unexpected (like in this case). Returning a new error code (delete_pending
) here is even better. As an implementation detail, looping like this seems undesirable, but perhaps it's the best way. As you pointed out, this strategy is already used elsewhere with success.
I will prepare a patch with the looping solution returning a new delete_pending
code if the loop completes unsuccessfully. Still open to any and all feedback/discussion.
To summarize the potential solutions:
Solution | Notes |
---|---|
Do nothing; callers must handle permission_denied |
This conflates the pending_delete case with bona fide permission issues. This behavior is specific to the Windows implementation of the status function. |
If the last ntstatus is STATUS_DELETE_PENDING , return no_such_file_or_directory |
This can mislead API consumers because the file is not actually deleted, and a subsequent write may fail. Write failures should be handled anyway, as even if the file is truly deleted, another process could still interfere in the window before writing. That said, a stuck process could be holding on to the pending-deletion file handle, which would mean the return of status does not at all accurately reflect the state of the file. |
If the last ntstatus is STATUS_DELETE_PENDING , keep trying for a while to see if the file is successfully deleted. If not, return a new error code delete_pending . |
This will catch most cases where there is a race to rename some destination file. If a process is stuck, the delete_pending case is returned and accurately reflects the state of the file. Checking this in a loop is a bit undesirable, but has been used with success elsewhere in the filesystem API. |
If the last ntstatus is STATUS_DELETE_PENDING , keep trying for a while to see if the file is successfully deleted. If not, continue to return permission_denied . |
Same as above, but doesn't require a new error code to be handled. |
If the last ntstatus is STATUS_DELETE_PENDING , immediately return a new error code delete_pending . |
This avoids the looping of the above two solutions without loss-of-information in the return. The caller can loop if they want, but I suspect in reality this error code would often not be handled and races would continue to be fatal. |
As a note on TEB
and RtlGetLastNtStatus()
, the offset of LastStatusValue
in the TEB
structure has been stable since at least Windows NT 3.1 (https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/pebteb/teb/index.htm) so that's unlikely to change. That said, Chromium and Gecko both use RtlGetLastNtStatus()
to avoid having to rely on this offset, so I think that is a better approach: https://github.com/mozilla/gecko-dev/blob/f6e3b81aac49e602f06c204f9278da30993cdc8a/ipc/chromium/src/chrome/common/ipc_channel_win.cc#L522C1-L541C2
This is another instance of the same root cause of https://github.com/llvm/llvm-project/issues/83046
On Windows, if a file is queried with
GetFileAttributesW
after it has been marked for deletion (but not yet deleted), the query will fail with the error code being set toERROR_ACCESS_DENIED
.Apple's symbol index generation code writes index unit files out with temporary names, and then calls
rename
to place the file in the final location. Multiple clang processes may produce the same index unit file, and they race to write the file out to the final destination (overwriting is okay, because the unit file is the same). But before generating index information, the final destination file is first checked to see if it's up-to-date which would avoid doing work to generate duplicate information. This check is done by callingfs::status
on the final destination file, and if this occurs while another clang process has just called to rename its temporary index file to the final destination file, there is a small window where the destination file is marked for deletion andfs::status
returns an unexpectedpermission_denied
error.Ideally, I think the
status
function should detect this case on Windows and returnfile_not_found
instead ofpermission_denied
, but I initially had trouble finding a way to do this. I filed https://github.com/apple/llvm-project/pull/8577 to treatpermission_denied
asfile_not_found
in this specific case (which fixes the build failures), but this solution is not very satisfying in that it leaks Windows-specific idiosyncrasies.However, a colleague pointed out that the underlying
NTSTATUS
code actually does disambiguate this case with0xC0000056 STATUS_DELETE_PENDING
(which is mapped to Win32ERROR_ACCESS_DENIED
). Querying this error code is not super straightforward (at least I couldn't find a straightforward way), but I was able to identify and predicate on this case in Windows-specific code with the following patch:I prefer this solution to the alternative because the additional complexity is nicely encapsulated, but it is not without caveats, namely having to link
ntdll
and a lack of confidence in compatibility w.r.tRtlGetLastNtStatus
(I couldn't find any Microsoft documentation on this function).Maybe I'm missing something and there's an easy way to inspect this error code? Is the cost of this solution as it stands too high?
cc @compnerd