spaghettidba / WorkloadTools

A collection of tools to collect, analyze and replay SQL Server workloads, on premises and in the cloud
MIT License
234 stars 53 forks source link

File capture replay takes very very long #59

Closed noamo48 closed 4 years ago

noamo48 commented 5 years ago

I'll preface by saying thank you for writing this and making it freely available. The issue is I used SqlWorkload to capture 8 hours worth of transactions in a prod instance. I have now been replaying this capture against my test instance for over 48 hours and still going. Why is it taking so long? Is it just repeating the replay over and over?

spaghettidba commented 5 years ago

Can the target server keep up with the load? I suggest that you look at the performance data captured by observing the replay workload. Is there any queries taking significantly longer than in the original capture?

noamo48 commented 5 years ago

I am testing against an Azure SQL Managed Instance with the same amount of resources as the VM from which I captured. All I have from the VM is the sqlite db of captured events....can I analyze this to compare against the Azure results? My plan was to replay against a clone of the VM once I'm finished replaying against the Azure SQL Managed Instance, and then compare those two result sets..

BTW based on manual tests the Azure SQL Managed Instance has been as fast or faster than the VM so I don't think it's incapable of handling the load.

spaghettidba commented 5 years ago

All I have from the VM is the sqlite db of captured events....can I analyze this to compare against the Azure results?

Yes, you can use a file listener and an analysis consumer to dump the performance data to an analysis database. If you want to compare the initial workload with the one you're replaying, you also need to capture the workload on the managed instance to file or directly to the analysis database.

One of the things to keep in mind is also latency. Where is the replay running from? A VM in Azure?

noamo48 commented 5 years ago

Oh yes I'm definitely dumping into an analysis db for the workload I'm replaying against the Azure instance, I just didn't do so during the initial capture so I can't compare against VM performance yet.

Is there any way to know how much longer is left on this replay? For example, does the analysis db tell me how many queries have already been run, and does the capture db tell how many queries in total will be replayed? If so, I could just subtract the former from the latter and figure out at least how many queries are remaining.

Regarding latency, I'm testing from an Azure VM in the same Vnet as the Azure SQL managed instance.

BTW thank you for your prompt help!

spaghettidba commented 5 years ago

I'm sorry, I don't think there is a way to tell how much is left. I'll make sure to add that to the output in future releases.

So, if you have your analysis db being populated with data from the replay, you could open it with workloadviewer and see whether the performance numbers are what you're expecting. Things like queries taking ages should be a red flag.

noamo48 commented 5 years ago

I did analyze what is recorded so far, using WorkloadViewer, and fortunately (or unfortunately) all of the avg duration of queries looks optimal. Also the Azure SQL Managed Instance metrics appear to show very little resource consumption. And the VM from which I'm running the replay isn't heavily loaded either, not in CPU, RAM, or disk IOPS. So I'm not sure why it's taking so long to finish the replay. But it's still going...

noamo48 commented 5 years ago

Just FYI - I was able to open the sqlite capture db and count the total number of events, and compare that to the sum of execution counts from Workload Viewer. Based on that, it seems like it should be finishing shortly.

spaghettidba commented 5 years ago

Ah, then we have found a bug. Sorry about that

noamo48 commented 5 years ago

Well it finished the replay a few minutes after my last comment was posted. So all in all, the replay takes forever but the tool is great. And this is coming from someone who tried every which way to measure performance metrics with DEA. So thank you for sharing your hard work!