Open randomascii opened 3 years ago
Thanks for filing here Bruce.
Update: This bug has now been fixed internally and will work its way through our engineering system into a future OS release.
Summary of the fix:
Fixed perf bug in VspQueryCopyFreeBitmap.
Bruce Dawson on the Google Chrome team pointed out a bug where Chromium builds had several hiccups in I/O. Others have also hit the same bug. The root cause is that the search for free regions of the volsnap CoW bitmap were incorrectly unbounded and could take multiple milliseconds on a 1TB drive.
Thanks for reporting this issue. Glad to have fixed it!
Will update this thread with details of which builds this fix will first arrive in.
Any updates yet on what build this will be fixed in? We still regularly hit this issue when imaging new machines, so I'm assuming it's not live yet?
I've sent a note to the team and am waiting on a status update. I'll post here as soon as I get an answer
Any updates on when this will be out?
a year later, any update?
When is this fixed? I have several incidents of this issue seen in the field and even one repro. When will this be backmerged to all still supported OS editions?
Hey folks! I just pinged the team for an update on this. Thanks for your patience and I'll post here as soon as I hear back.
Hey folks! It looks like this issue should have been fixed in Windows 11. If running Win11 and still experiencing the issue please comment and I can give these data points to the team.
pov: your device doesn't support windows 11 with its high requirements
Hey folks! It looks like this issue should have been fixed in Windows 11. If running Win11 and still experiencing the issue please comment and I can give these data points to the team.
So Windows Server 2022 should also not exhibit the issue? Is there a Ticket Id I could reference for a backport to Server 2016? Any chances there or was this part of a complete rewrite of the Storage Subsystem like it did happen for Memory Management or the UI Subsystem?
@bitcrazed: Could I get some information on which server OS that is fixed? Would be a backport to earlier OS versions be possible. That happens too often with bad effects.
Hey folks. To help us better understand the scope of impact of this issue, could you share where this is impacting you, how many users/machines, what workloads are impacted, etc?
Many thanks in advance.
Hey folks. To help us better understand the scope of impact of this issue, could you share where this is impacting you, how many users/machines, what workloads are impacted, etc?
Many thanks in advance.
This issue had caused severe performance impact in of our customer sites. Multiple users were affected (<=10)
At least for me this is affecting servers with large RAIDs where formatting the drive is not an option for production machines. If the issue keeps coming back it is a bad situation to be in.
Which of these mitigations are actually helping?
It would be good to have some tool to trace the duration of NtfsFreeRecentlyDeallocated so one can easily check if the issue is there. Currently there is nothing except ETW profiling available. NTFS uses WPP where one would need to author a custom TMF file or have private symbols.
can someone please have a look in to the source code ? into which build number has the fix went? Server 2022 is build number 20348, which is a Windows 10 build windows 11 build numbers are > 22000
I did a little digging with the dev team and confirmed it was fixed in Windows 11 and is fixed in Window Server 2025 - you can validate the preview here: https://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-2025?msockid=0eceedf674a061483924f949751a6064.
However, it is not fixed in Windows Server 2022. To backport, I would need customer impact information. If your company has an MS field contact, please have them contact me, or you can DM me on twitter/x @TheAdamBr if you don't feel comfortable sharing info publicly.
@AdamBraden: The bug seems to be related to VSS (Volume Shadow Copy) which was not needed on that data drive. Disabling it did solve the issue for us. But it was a long journey which needed several calls to get a definitive answer. One the C drive I have not seen this in the wild yet. Thanks for confirming that the issue is finally solved on Windows Server 2025 and Windows 11.
So, the issue was found in October of 2019, it was formally reported in November of 2020 (shouldn't really be necessary, but okay), it was "fixed" promptly, but the fix doesn't ship to server SKUs (where it is needed most?) until late 2024.
Given the severity of the bug, in some scenarios, this seems disappointingly slow. And communication could have been better - I still can't really tell where it is fixed and where it isn't.
I'm glad it never affected me. I'm just the random Internet person who identified the problem using trace data from a third party. I guess I'm cynically wondering why I was needed to help resolve this process, and what this apparent need says about Microsoft's performance culture, and why Microsoft didn't find and fix the problem using their own trace data.
@randomascii: Support was better some years ago before MS did transform all (at least Germany) technical support guys into Cloud Solution Architects. Parallel they did outsource support to e.g. Egypt and other countries where labor costs are cheaper. In the end the new guys call back to HQ in Seattle where now even more overworked (at least that is my impression) guys are handling too many tickets.
The support process usually boils down to
This script collects everything although some more streamlined settings could collect data much faster (especially ETW) but you can never talk to the actual guy who looks at the data.
In some cases on some machines if system restore is enabled then RtlFindNextForwardRunClear may end up spinning in a seven-instruction loop for multiple seconds while holding a lock. This prevents basic operations like WriteFile from completing. In the case where this was first hit this caused a 64-processor machine to grind to a halt, repeatedly.
A full explanation can be found here:
https://randomascii.wordpress.com/2019/10/20/63-cores-blocked-by-seven-instructions/
I have heard that this bug has been fixed but was asked to file an issue to formally track it: https://twitter.com/richturn_ms/status/1330947602129448961