microsoft / ProjFS-Managed-API

A managed-code API for the Windows Projected File System
Other
145 stars 34 forks source link

Odd file behavior in virtual branches, perhaps after installing Windows Updates #70

Open FeralSquid opened 3 years ago

FeralSquid commented 3 years ago

Hello,

My organization has been using a ProjFS-based file syncing setup since late last year. Since around March of this year, the number of machines using this system has grown from ~400 to ~700 now. This is the primary way people get and interact with the files they work on constantly every day.

I give this background because, as of yesterday, I've now seen what appears to be a new and very odd issue for a 2nd time (on a different machine) within the last ~week.

Nothing about the code managing ProjFS or the backing store information has changed in any substantial way for a few months now (some minor inspection improvements is about it, nothing that impacts interactions with ProjFS).

The only potentially interesting data point I have is that a few days before the first case our IT group rolled out a series of Windows Updates. That said, in both cases I had confirmed that the users did NOT have any pending windows updates at the time (they had installed and rebooted since), as that can cause all kinds of odd behavior.

The updates installed were KB5004245, KB5003539, KB5004748, KB5004772. The last batch of windows updates were installed a month prior (before this issue was first reported).

In both cases, the odd behavior mysteriously went away after a system restart.

Because that appears to fix it, and because our users sometimes like to attempt to fix things themselves and not let us know, it is possible that there have been other cases of this that I'm just not aware of. That said, the timing of 2 cases within a week, and just after the updates, combined with the reboot-as-a-fix, seems a little suspicious.

Anyway, on to the odd behavior...

The user notices the issue when a ProjFS file fails to be read, and so causes some user application to fail to start (if it is an exe/dll/etc) or fail to load some content file.

This behavior, though, only impacts a subset of the ProjFS files under a given virtual root, while other virtual files work perfectly fine (I am able to trigger placeholder creation, hydration, PrjDeleteFile back to virtual and repeat with other files). In the first case, all ProjFS files in a given directory were impacted (there were some full files mixed in the directory that worked normally). I'm not sure if it was all files in the same directory for the 2nd case as the user rebooted before I could test that case thoroughly.

In the first case, though, I did have time to try a handful of things on these badly behaving files, with odd results:

...then, in C#, I tried: var path = // bad path var fi = new FileInfo(path); // fi.Exists => false // fi.Attributes => 0xFFFFFFFF

var di = new DirectoryInfo(Path.GetDirectoryName(path)); var file = di.EnumerateFiles().First(f => string.Compare(f.FullName, path) == 0); // file is NOT NULL ... it was enumerated // file.Exists => TRUE !! // file.Attributes => 0x00440001 (RECALL_ON_DATA_ACCESS | RECALL_ON_OPEN | READONLY) -- looks like a placeholder, but no REPARSE_POINT attribute??

using (var fs = file.OpenRead()) { fs.ReadByte(); } // throws FileNotFoundException

FindFirstFileEx w/ FIND_FIRST_EX_ON_DISK_ENTRIES_ONLY lists no items in the directory !? (full files had been deleted at this point, leaving only the badly behaving files in the directory) FindFirstFileEx without that flag lists the bad file with the above attributes

Calling CreateFile() directly to try and get any reparse point data also failed to invalid path.

I also tried:

... and re-running all of the above tests, all with the same result.

Throughout all of this, my ProjFS application logged no errors, and I suspect got not callbacks (though I can't fully prove that).

The file in question had been newly created on this machine the day before, then submitted to version control and the user had later sync'd. So, the file should have transitioned from Full-Writable -> Full-ReadOnly -> PrjDeleteFile to make it virtual -> Placeholder (app-triggered) and broken. All of this occurred a day or two after taking the windows updates.

I didn't notice any obvious critical or error system events in that time period.

And then, as a bit of a last ditch, I asked the user to reboot ... and after they did, the badly behaving placeholder files suddenly worked perfectly!

Rebooting also "magically" fixed the 2nd case.

Any ideas? Any other things you'd want me to try if/when I see the next case?

Thanks!

cgallred commented 3 years ago

I don't entirely know how to explain that behavior, especially that a reboot fixed things.

Because of the way the .NET System.IO file classes behave, that C# program unfortunately isn't telling you what you may think it is. System.IO.FileInfo doesn't open a file handle to determine whether a file "exists". Instead, it uses FindFirstFile to look for it in the parent directory. If the file appears in that enumeration, then it "exists". It does not open a handle to the file to confirm its presence on disk. The fact that your program shows you the RECALL_ON_OPEN bit in the file's attributes is a clue to this behavior. RECALL_ON_OPEN is a synthetic attribute bit that is set only for virtual files (the ProjFS file system filter inserts the bit for files that the provider returns in its enumeration). So the file "exists" in the sense that a directory enumeration shows it, but it is not a placeholder. BTW, even if it were a ProjFS placeholder you wouldn't see the REPARSE_POINT attribute bit in an enumeration; we hide the fact that there's a reparse point for app compat reasons. There are a lot of apps out there that make incorrect assumptions about what it means for a file to have a reparse point on it.

Your provider must have gotten directory enumeration callbacks, otherwise the C# app wouldn't have seen anything in the enumeration (since what it saw was a virtual file).

This also explains why FindFirstFileEx(...FIND_FIRST_EX_ON_DISK_ENTRIES_ONLY...) sees nothing, but without the ON_DISK flag it sees what the C# program did. With the flag no enumeration callbacks are sent to your provider.

So did your C# app use a bogus path for the file? That's what I understood from the comment "bad path" on the first line. If that's the case, then it seems like there could be a bug either in how we're sending directory enumeration callbacks, or there's a bug in your provider in how it services the callbacks. Somewhere along the line, a bad file path turned into something that your provider was able to enumerate (hence the virtual files returned for enumeration), but that it could not produce placeholder information for (hence the failure to actually open the file when you called fs.OpenRead()).

FeralSquid commented 3 years ago

sorry, "bad path" was meant to mean the full path to the misbehaving file. The path itself is a correct path to a virtual file.

I have noticed that .net File attributes returned are not always accurate/complete with virtual or placeholder files, which is why I was surprised that I got the same attributes returned when I enumerated with a FindFirstFileEx (without the ON_DISK flag). FindFirstFileEx that does list REPARSE_POINT as an attribute in normal situations when enumerating with the ON_DISK flag, at least, though I can't say I've tested it without the ON_DISK flag as I generally am doing it to avoid invoking callbacks.

If/when I see this again I will verify if I'm getting callbacks or not related to these misbehaving files.

FeralSquid commented 3 years ago

ok, we just had another case and I was able to remote to the machine and debug my host app while a directory was in a bad state.

In this case there was a single directory that was behaving oddly, and was slightly different from above, so may not be exactly the same issue ... but has a lot of similarities.

Here is what I observed in this case:

I did dump the reparse data again to compare now that it was working and the thing that changed were the 16 bytes before the ProviderId, but that appeared to have changed to a new, consistent value for all directory placeholders created with this instance of my host app.

It seems like there must be some state that is going wrong in how directories are mapped to callback host applications (especially given the 'virtualization provider not available' errors for this directory while all the surrounding directories were under the same one virtual root mapped to my one provider app instance)?

I would think that it was in the directory reparse data, but the fact that the other similar issues were resolved with a reboot only makes it seem like there must be some in-memory mapping that is getting mucked up somehow (and nuking the directory placeholder, as we did in this case, also resets that state similar to a reboot)?

Thanks for you help!

abhijeet-gautam commented 3 years ago

If you still have the reparseData dumps would you mind sharing those (please obscure the data that shouldn't be visible) ? I am hoping you did see a ProjFs reparseTag for the misbehaving directory \bin\online_Release\deployment as well.

FeralSquid commented 3 years ago

Unfortunately I don't have the raw reparse data dumps, but as part of my tests I did look at them and compare the misbehaving directory's reparse data to its working parent directory's reparse data, and they were identical except for the path strings (and reparse data size, because of the string length difference).

I will attach the raw dump if/when we see another case.

fanzeyi commented 3 years ago

Hi all, I'd like to report a similar case we have been seeing with our project (https://github.com/facebookexperimental/eden).

We have received several user reports where files listed in ProjectedFS is not accessible despite the file itself is available. The file will appear in file explorer as well as directory listing in PowerShell. I confirmed the file in question was included in our reply to GetDirectoryEnumerationCallback.

However, any attempt to directly access the file will result as a file not found error from Windows. For example, a Python tool we have to check ProjectedFS state via ctypes will report FileNotFoundError: [WinError -2147024894] The system cannot find the file specified. Other debugging tools will report similar errors. We are not able to query the reparse data for the file as fsutil reparsepoint query will report Error: The system cannot find the file specified..

We also do not observe any getPlaceholderInfo calls to the file in question in our provider even though we did see getPlaceholderInfo calls for the variant of the same file (filename.exe, filename.lnk and filename.exe.lnk). I believe these calls are fallback path of some code trigger by the initial file not found error.

The machine seeing is issue is also on the latest Windows version (21H1). The only pending patch not installed is kb4023057.

Similar to OP, we tested both restarting the virutalization provider and clearing the negative path cache but no avail. The only workaround we have found so far is to reboot the machine.

abhijeet-gautam commented 3 years ago

@fanzeyi thanks for reporting the issue. If you still have the repro would you mind capturing and sharing ProjFs logs ? If yes,

  1. please copy the following into a .cmd file and run from an admin cmd prompt.
  2. Repro the scenario - where it reports the file not found error.
  3. Stop the .cmd script by pressing any key.
  4. A file projfstrace_000001.etl would be generated at the path you ran the script in 1.
  5. Please send the file to us for investigation.
@echo off
logman create trace -n ProjFS_ALL -o projfstrace.etl -nb 128 640 -bs 128
logman update trace -n ProjFS_ALL -p {b89e90a5-b619-50dd-ac92-a91115f73059} 0x7FFFFFFF 255
logman update trace -n ProjFS_ALL -p {B6D7DC51-78CF-4E85-8BAC-488A9F47A0BB} 0x7FFFFFFF 255
logman update trace -n ProjFS_ALL -p {baca0e64-1bd1-506a-a250-a10ab355fbd2} 0x7FFFFFFF 255
logman update trace -n ProjFS_ALL -p {32F63487-712D-4324-A29F-CBF4940C8C0C} 0x7FFFFFFF 255
logman start -n ProjFS_ALL

echo Trace started.  You may now run your scenario.
echo When complete, hit any key in this window to stop the scenario.
pause

logman stop -n ProjFS_ALL
logman delete -n ProjFS_ALL

@FeralSquid you could help us with the traces as well.

xavierd commented 3 years ago

Adding some context. Looking back at the internal EdenFS user support, it looks like the first time this issue was reported was in April this year, but some users may have hit the issue sooner than that.

xavierd commented 3 years ago

One of our user just hit what appears to be this issue.

At first, listing files in a directory under the virtualization root showed all but one file. Thinking that this might be an EdenFS bug, they then restarted EdenFS, but trying to open the file afterwards gave them the "The provider that supports file system virtualization is temporarily unavailable". Creating a file in that directory also failed with the same error, which appears oddly similar to @FeralSquid's comment from above.

The reparse point on the directory appears to be in order:

PS C:\open\fbsource> fsutil reparsepoint query C:\path\to\directory 
Reparse Tag Value : 0x9000001c
Tag value: Microsoft
Tag value: Directory

Reparse Data Length: 0x198
Reparse Data:
0000:  02 00 00 00 02 00 00 00  c6 c5 ce d1 c9 d7 b6 43  ...............C
0010:  98 f1 ca dd e1 6a 5a 2e  00 00 00 00 00 00 00 00  .....jZ.........
0020:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0030:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0040:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0050:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0060:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0070:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0080:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0090:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00a0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00b0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00c0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00d0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00e0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00f0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0100:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
[... path omitted ...]

Querying the file in question fails:

PS C:\open\fbsource> fsutil reparsepoint query C:\path\to\directory\BUCK
Error:  The provider that supports file system virtualization is temporarily unavailable.

I've pointed the user to this issue and they collected the trace requested above. What's the best way to send you this trace file?

cgallred commented 3 years ago

How big is the trace file?

xavierd commented 3 years ago

The zipped version is ~20KB, but I'm more concerned about potential private information contained in it to post it on Github :)

cgallred commented 3 years ago

Oh, I don't want you to post it to GitHub in any case. :-) I was thinking along the lines of email vs. some file share service.

Please attach it to an email and send it to msfltdev (at) microsoft (dot) com.

FeralSquid commented 3 years ago

Hello all,

I was just able to grab the logs from a new live repro case. I've emailed them to the address above.

Any luck finding something in xavierd's logs?