microsoft / Windows-Dev-Performance

A repo for developers on Windows to file issues that impede their productivity, efficiency, and efficacy
MIT License
441 stars 21 forks source link

RtlFindNextForwardRunClear holds NTFS lock for multiple seconds #64

Open randomascii opened 3 years ago

randomascii commented 3 years ago

In some cases on some machines if system restore is enabled then RtlFindNextForwardRunClear may end up spinning in a seven-instruction loop for multiple seconds while holding a lock. This prevents basic operations like WriteFile from completing. In the case where this was first hit this caused a 64-processor machine to grind to a halt, repeatedly.

A full explanation can be found here:

https://randomascii.wordpress.com/2019/10/20/63-cores-blocked-by-seven-instructions/

I have heard that this bug has been fixed but was asked to file an issue to formally track it: https://twitter.com/richturn_ms/status/1330947602129448961

bitcrazed commented 3 years ago

Thanks for filing here Bruce.

bitcrazed commented 3 years ago

Update: This bug has now been fixed internally and will work its way through our engineering system into a future OS release.

Summary of the fix:

Fixed perf bug in VspQueryCopyFreeBitmap.

Bruce Dawson on the Google Chrome team pointed out a bug where Chromium builds had several hiccups in I/O. Others have also hit the same bug. The root cause is that the search for free regions of the volsnap CoW bitmap were incorrectly unbounded and could take multiple milliseconds on a 1TB drive.

Thanks for reporting this issue. Glad to have fixed it!

Will update this thread with details of which builds this fix will first arrive in.

zjturner commented 3 years ago

Any updates yet on what build this will be fixed in? We still regularly hit this issue when imaging new machines, so I'm assuming it's not live yet?

AvriMSFT commented 3 years ago

I've sent a note to the team and am waiting on a status update. I'll post here as soon as I get an answer

klauroblox commented 3 years ago

Any updates on when this will be out?

ghost commented 2 years ago

a year later, any update?

Alois-xx commented 2 years ago

When is this fixed? I have several incidents of this issue seen in the field and even one repro. When will this be backmerged to all still supported OS editions?

AvriMSFT commented 2 years ago

Hey folks! I just pinged the team for an update on this. Thanks for your patience and I'll post here as soon as I hear back.

AvriMSFT commented 2 years ago

Hey folks! It looks like this issue should have been fixed in Windows 11. If running Win11 and still experiencing the issue please comment and I can give these data points to the team.

thatsofia commented 2 years ago

pov: your device doesn't support windows 11 with its high requirements

Alois-xx commented 2 years ago

Hey folks! It looks like this issue should have been fixed in Windows 11. If running Win11 and still experiencing the issue please comment and I can give these data points to the team.

So Windows Server 2022 should also not exhibit the issue? Is there a Ticket Id I could reference for a backport to Server 2016? Any chances there or was this part of a complete rewrite of the Storage Subsystem like it did happen for Memory Management or the UI Subsystem?

AloisKraus commented 1 year ago

@bitcrazed: Could I get some information on which server OS that is fixed? Would be a backport to earlier OS versions be possible. That happens too often with bad effects.

bitcrazed commented 1 year ago

Hey folks. To help us better understand the scope of impact of this issue, could you share where this is impacting you, how many users/machines, what workloads are impacted, etc?

Many thanks in advance.

karthikkbgl commented 1 year ago

Hey folks. To help us better understand the scope of impact of this issue, could you share where this is impacting you, how many users/machines, what workloads are impacted, etc?

Many thanks in advance.

This issue had caused severe performance impact in of our customer sites. Multiple users were affected (<=10)

AloisKraus commented 1 year ago

At least for me this is affecting servers with large RAIDs where formatting the drive is not an option for production machines. If the issue keeps coming back it is a bad situation to be in.

Which of these mitigations are actually helping?

It would be good to have some tool to trace the duration of NtfsFreeRecentlyDeallocated so one can easily check if the issue is there. Currently there is nothing except ETW profiling available. NTFS uses WPP where one would need to author a custom TMF file or have private symbols.

AndreasDiet commented 1 year ago

can someone please have a look in to the source code ? into which build number has the fix went? Server 2022 is build number 20348, which is a Windows 10 build windows 11 build numbers are > 22000

AdamBraden commented 3 months ago

I did a little digging with the dev team and confirmed it was fixed in Windows 11 and is fixed in Window Server 2025 - you can validate the preview here: https://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-2025?msockid=0eceedf674a061483924f949751a6064.

However, it is not fixed in Windows Server 2022. To backport, I would need customer impact information. If your company has an MS field contact, please have them contact me, or you can DM me on twitter/x @TheAdamBr if you don't feel comfortable sharing info publicly.

AloisKraus commented 3 months ago

@AdamBraden: The bug seems to be related to VSS (Volume Shadow Copy) which was not needed on that data drive. Disabling it did solve the issue for us. But it was a long journey which needed several calls to get a definitive answer. One the C drive I have not seen this in the wild yet. Thanks for confirming that the issue is finally solved on Windows Server 2025 and Windows 11.

randomascii commented 3 months ago

So, the issue was found in October of 2019, it was formally reported in November of 2020 (shouldn't really be necessary, but okay), it was "fixed" promptly, but the fix doesn't ship to server SKUs (where it is needed most?) until late 2024.

Given the severity of the bug, in some scenarios, this seems disappointingly slow. And communication could have been better - I still can't really tell where it is fixed and where it isn't.

I'm glad it never affected me. I'm just the random Internet person who identified the problem using trace data from a third party. I guess I'm cynically wondering why I was needed to help resolve this process, and what this apparent need says about Microsoft's performance culture, and why Microsoft didn't find and fix the problem using their own trace data.

AloisKraus commented 3 months ago

@randomascii: Support was better some years ago before MS did transform all (at least Germany) technical support guys into Cloud Solution Architects. Parallel they did outsource support to e.g. Egypt and other countries where labor costs are cheaper. In the end the new guys call back to HQ in Seattle where now even more overworked (at least that is my impression) guys are handling too many tickets.

The support process usually boils down to

  1. Repro issue
  2. Download https://learn.microsoft.com/en-us/troubleshoot/windows-client/windows-tss/introduction-to-troubleshootingscript-toolset-tss
  3. Run issue under this tool which takes hours to complete, or never completes.
  4. Send data back and hope for the best.

This script collects everything although some more streamlined settings could collect data much faster (especially ETW) but you can never talk to the actual guy who looks at the data.