Open itoleck opened 2 years ago
here is also an interesting thing I found some years ago. The Intel Management Engine causes boot delays.
Btw, you should also explain a bit (non) paged pool memory tracing. I also made a guide which shows how to find usage of tags that are used by Windows itself
This is a long one, get some hot cocoa.
Scenario:
A domain joined Windows 10 virtual machine was slow to boot to the user's desktop. Initial investigation did not show very high CPU usage. Storage IO may be high. There is no information on the type of storage underlying the virtual hard disk (SD or HDD).
Analysis:
Winlogon Issue #1
Open trace and add the 'Other' - 'Boot Phases' graph to the analysis view. If your Windows Performance Analyzer does not have the Other-Boot Phases graph available download the newest Windows ADK installer for your Windows version and uninstall the current version and install the newer version.
The 'Boot Phases' graph shows the 5 different phases of boot, (Pre Session Init, Session Init, Winlogon, Explorer Init and Post Boot). Each of these phases should take as little time as possible to complete. Post boot is not relevant in most cases as it is after the user logon unless there is a performance issue after the user logon launching applications. Focus on the Winlogon and Explorer Init in this trace as they are both fairly long. In this trace the 'Session Init' is a bit long as well but focusing time on the possibility of gaining back 5 seconds .vs the other 2 phases is not a good use of time.
Select the Winlogon phase bar in the graph and right-click the graph and Zoom to the Winlogon phase. Each graph added to the same analysis view will have the same zoom level. This way you can focus on 1 phase at a time.
Some Windows internals knowledge will go a long way in finding root causes of slow performance in ETW traces. Winlogon phase comprises of many different sub phases. View the sub phases by adding in the 'System Activity' - 'Windows Logon' graph to the analysis view. There are 3 Notifications in the table (CreateSession, Logon and StartShell). There is also a time before the CreateSession activity that is empty whitespace where it looks like nothing was happening. This is where the internals knowledge comes to play. Auto-Start services start at the beginning of the Winlogon phase. All Auto-Start services up to the 'ProfSvc_Group' service group must have started before Windows will create the first session.
Investigate the services by adding the 'System Activity' - 'Services graph' in the analysis view. You can also left-click and drag in the graph to select the empty space between 0 seconds and ~39 seconds if you would like to increase your focus.
In the Services graph add the 'Started At (s)' column to the very beginning of the column list and sort Ascending. Sorting descending on the 'Service Init Time (s)' is also a good way to find service start issues. You should see that service, System Optimizer' is taking a long time to start at ~16.5 seconds until ~37.8 seconds. At this popint you can remove or update the service or change the start type to manual but let's see if there is a way to get more information on the service first.
You may want to view more information (Publisher, Version, etc.) on the process that the service runs. That can be done in the 'System Activity' - 'Images' graph; add it to the analysis view. Sort by Process and add the following columns to the view: File Description, File Version, Binary File Version, Productname, Company Name and Product version.
Open the Process System Optimizer.exe (740) and you will see all of the images/modules that the service loaded when started. Find the 'System Optimizer.exe image name and open. You should see the 'Product name', 'Binary File Version' and 'Product Version' have values, while the other columns added do not. At this point there is no way to tell what the service is or where it came from.
At this point the analysis view screen should be filling up with graphs. Minimize the graphs on the left side caret/arrow. Add the 'Computation' - 'CPU Usage (Precise)' graph to the analysis view. You can check if the trace has stack information available by going to the menu and selecting 'Trace' - 'System Configuration' and viewing Trace Statistics. If there are StackWalk events listed there should be stack information in the trace. Usually StackWalk events are the highest count of events in a verbose trace.
There are a few reasons that the System Optimizer process may be taking a long time to start. One is that the service is doing some processing and needs to finish the work and the other is that the processes threads are waiting of something else to finish. Check the CPU usage by sorting by % CPU Usage descending. See that the process is only 0.11% weight. There are 2 CPUs available so if a thread was fully utilizing a CPU the weight would be 50 maximum for 1 thread/CPU. It does not look like the issue is too much processing.
Setup the columns for the 'CPU Usage (Precise)' graph as follows. New Process | New Thread | New Thread Stack | Ready Thread Stack | Readying Process | Readying Thread |GOLDBAR| Waits (us) Max | Wait (us) Sum | Count |BLUEBAR| % CPU Usage Sum
Before you can view detailed stack information you must verify your symbols configuration and 'Load Symbols'.
Find the System Optimizer process and open the caret. See the threads. All processes have at least 1 thread, but many processes have many threads which can make finding any threads that are waiting difficult. Sorting by either Waits (us) Max and Waits (us) Sum can help find issues in the threads. Sometimes you may have to open each thread and look at the stack. In this case we know that the time the service takes to start is almost exactly 20 seconds. There is 1 thread with a Waits(us) Max of 20,074,833.700us, thread 768. Open thread 768.
Open each stack entry that has close to the 20 seconds of wait time. The first stack entry is 'ntdll.dll!_RtlUserThreadStart' then 'sechost.dll!ScSvcctrlThreadW' then 'System.ServiceProcess.ni.dll!System.ServiceProcess.ServiceBase.ServiceMainCallback(Int32, IntPtr)' and 'mscorlib.ni.dll!System.Threading.WaitHandle.WaitOne()'. You should get to the first 'Ready Thread Stack' or [Root], open this and do the same opening each entry. At the end you should see the process and thread that the stack was waiting for, System Optimizer.exe (740) Thread 772. You must now repeat the process with thread 772.
Open thread 772 in the System Optimizer.exe (740) process. You should see that the Wait (us) max is still about 20 seconds. This means that the previous thread was not to blame. It was waiting for this thread or another down the chain to finish. Open each entry. You can use the left keyboard key to open stack entries easily. At the end of the Ready Thread Stack entries you should see the process and thread that this thread was waiting for which is Idle (0) thread 0. This means that the waiting was in the 772 thread. Somewhere in the stack of the 772 thread is the key to the issue. look at the functions and find 'mscorlib.ni.dll!System.Threading.Thread.Sleep(Int32)' and the other sleep functions. Notice the count of events is 1. This means that the service has a sleep timer of 20 seconds built in. That does not seem like an optimization to me.
Winlogon Issue #2
After fixing the service issue the Winlogon boot phase should take 20 less seconds, that's pretty good. Let's see if there is anything else that can be optimized. Go back to the Winlogon graph and look for long running notifiction steps in Winlogon. You should see that Logon notification is long at 34.274647200 seconds. Open the Logon item to find SessionID 1, open session 1. You should see 5 Subscribers, GPClient, Profiles, Sens, SessionEnv and TermSrv. The Profiles Subscriber is long at 32.723633100 seconds.
The Profiles Subscriber is when the user's profile is loaded after logon. Select the Profiles row and zoom into the graph. The user's profile consists of registry settings, a local or remote drive location and roaming information if enabled. A good next place to look would be the storage.
2 storage related graphs are available, Disk Usage and File I/O. Start with Disk Usage and add it to the analysis view. Change the columns to the following view.
Disk | Priority | Process | IO Type | Path Tree | IO Init Stack |GOLDBAR| Size (B) Sum | Count |BLUEBAR| Disk Service Time (us) Sum
Sort by 'Disk Service Time (us) Sum descending. Open disk 0, priority normal and find Process svchost.exe (364). This is the process with the highest latency. Open the process and view the reads, writes and flush IO Types. Read should be the highest. Open the 'Path Tree' and find that $Mft is the highest latency. $Mft is the file table for the formatted partition. In this case C:. Open the 'IO Init Stack' and see that the 'userenv.dll!CopyFileFunc' entry is the highest latency. userenv.dll is a dll for the user profiles. Note that the actual user profile location c:\users is very low latency(118ms total) and size(~1.5MB total).
At this point we know there is a lot of activity in the file table but not in actual files. Let's look at FileIO graph. Add it to the analysis view.
Change the columns to the following view. Sort by 'Duration (us) Sum' descending.
Process | Event Type |GOLDBAR| File Path | Thread | Duration (us) Sum | Size (B) Sum | Result |BLUEBAR| Count
Explorer Init Issue
The Explorer Init boot phase is long at 37.53 seconds. Un-zoom the Boot Phases graph and select the Explorer Init phase and right-click and zoom.
In this boot phase auto start applications are being started from multiple locations such as the 'Run' and 'RunOnce' registry locations, Startup start menu folder and scheduled tasks. Look for new processes in this phase by adding the 'System Activity' - 'Processes' graph to the analysis view.
Change the columns to the following view.
Process |GOLDBAR| Parent Process ID | Command Line |BLUEBAR| Start Time (s) Min | End Time (s) Max
Sort by Start Time (s) Min ascending. Since the Explorer phase starts at ~94 seconds shift-select all rows that have higher start times than 93 seconds. Right-click the rows and select 'Filter To Selection'.
Find that there are some processes starting at this time and they all take some time before starting the next. Two stand outs are the 2 processes wermgr.exe. This is Windows Error Reporting. Either an application crashed or the process is packaging and sending data of previous crashes to Microsoft for analysis. In 1 of the wermgr.exe command lines you can see the -upload switch. So this looks like an upload.
By looking at the 'Parent Process ID' column for wermgr.exe you can see that the parent process is 364. You can find what service this is by going to the Menu -> Trace -> System Configuration -> Services. You should find that one of the services in process 364 is Scheduler, the Task Scheduler. The Windows Error Report was sent when the task was started after logon.
Remediation:
Remove the sleep function in the System Optimizer.exe service. The service had a 20 second delay which can be removed with a code update.
Remove the empty files from the user's documents folder. There was 17.23 seconds of IO to the user's documents folder so some of this can be saved.
Change Scheduled task for Windows Error Reporting to start after logon only after idle.