spaghettidba / WorkloadTools

A collection of tools to collect, analyze and replay SQL Server workloads, on premises and in the cloud
MIT License
238 stars 54 forks source link

WorkloadTools issue with Commvault Backups #146

Closed SimonBillinghurst closed 11 months ago

SimonBillinghurst commented 11 months ago

I have a listener writing to a WorkloadFileWriterConsumer which starts recording just fine, but as soon as our Commvault backup process kicks in and starts its incremental backup of the database no more events are being recorded.

I can see the logs just stops writing, I don't see any relevant events in the event viewer. As you can see below it is recording events, then just stops, I can see in the sqlite database where it records the Commvault queries and then nothing is recorded from that point on other then empty 4 and 6 events.

2023-11-23 11:58:10.8112 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (1 rows) 2023-11-23 11:58:10.8112 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 278 rows aggregated 2023-11-23 11:58:10.8112 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 54 rows written 2023-11-23 11:58:10.8268 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written 2023-11-23 11:58:10.8268 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written 2023-11-23 11:59:10.8267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (2 rows) 2023-11-23 11:59:10.8267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 1442 rows aggregated 2023-11-23 11:59:10.8267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 92 rows written 2023-11-23 11:59:10.8424 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written 2023-11-23 11:59:10.8580 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written 2023-11-23 12:00:10.8869 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (2 rows) 2023-11-23 12:00:10.8869 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 97 rows aggregated 2023-11-23 12:00:10.8869 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 35 rows written 2023-11-23 12:00:10.9025 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written 2023-11-23 12:00:10.9181 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written 2023-11-23 12:01:10.9121 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (0 rows) 2023-11-23 12:01:10.9121 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows aggregated 2023-11-23 12:01:10.9121 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows written 2023-11-23 12:01:10.9277 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written 2023-11-23 12:01:10.9277 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written 2023-11-23 12:02:10.9524 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (0 rows) 2023-11-23 12:02:10.9524 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows aggregated 2023-11-23 12:02:10.9524 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows written 2023-11-23 12:02:10.9524 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written 2023-11-23 12:02:10.9681 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written 2023-11-23 12:03:11.0010 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (0 rows) 2023-11-23 12:03:11.0010 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows aggregated 2023-11-23 12:03:11.0010 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows written 2023-11-23 12:03:11.0010 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written 2023-11-23 12:03:11.0167 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written 2023-11-23 12:04:11.0267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (0 rows) 2023-11-23 12:04:11.0267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows aggregated 2023-11-23 12:04:11.0423 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows written 2023-11-23 12:04:11.0423 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written 2023-11-23 12:04:11.0423 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written

spaghettidba commented 11 months ago

Would you mind testing this with version 1.7.0 that I just released?

SimonBillinghurst commented 11 months ago

Thanks for the reply, unfortunately this did not help, same issue.

spaghettidba commented 11 months ago

Bummer. Do events show up in the source XEvents session? If you watch live data, do events appear?

SimonBillinghurst commented 11 months ago

Sorry I am not sure what you mean. I am not a DBA by any means I am code jokey who is trying to work out what impact changing the compatibility mode of our SQL database from 2008 to something newer. So while I can right SQL statements my knowledge of underlying SQL is limited.

Can you explain in a bit more detail where you would like me to look.

Sorry

spaghettidba commented 11 months ago

No worries, I ran into the same bug. I'm sorry about the issue, I will try to solve it ASAP.

spaghettidba commented 11 months ago

It looks like we have a bug here, but I am unable to fix it. It has nothing to do with CommVault, but it has to do with something in the code that stops the collection of events from the stream of the XEvents session. I verified that it's not happening in 1.6.2, so it's something introduced from 1.6.5. @a-teece are you able to reproduce it? It's mostly your code, but I have a hard time understanding what change exactly triggered the problem. It looks like StreamXEventDataReader.ReadEvents blocks when it performs eventsEnumerator.MoveNext(), but I really can't tell why. Help!

a-teece commented 11 months ago

@spaghettidba I've been through all the changes in StreamXEventDataReader itself and nothing has functionally changed - lots of changing variables to be inferred and use var, lots of braces on separate lines, etc. All just code consistency without changing the logic.

Which means it's something above StreamXEventDataReader in the stack. Still looking....

spaghettidba commented 11 months ago

Yes, I came to the same conclusion. It must be something else, but I really can't tell what. :(

spaghettidba commented 11 months ago

BTW, are you able to reproduce this behavior? For me it stops producing results in the analysis database after 1 or 2 minutes in average, in general before the 5 minutes mark.

a-teece commented 11 months ago

@spaghettidba Going to try now. Most of my work was on replaying the working because the capture was done on a much older build. I'll go through the changes that are common to capture/reply and see if something that was changed to solve replay issues had an unintended consequence on the capture.

a-teece commented 11 months ago

@spaghettidba I can't replicate, however my fork doesn't have your latest changes, it only has the last ones I sent you. So I'm going to update it and try again which would help narrow down the change.

a-teece commented 11 months ago

@spaghettidba I can't replicate at all when running from Visual Studio, latest committed code, against SQL 2017.

spaghettidba commented 11 months ago

I see... It's weird. For me it's consistent: 1.7.0 stops recording at some point. Same for 1.6.5 If I try with 1.6.2, it works flawlessly. I'll see if I can find the problem. Thanks for your help!

a-teece commented 11 months ago

@spaghettidba Just trying now on an installed 1.7.0

a-teece commented 11 months ago

@spaghettidba Running for 10 min on a deployed 1.7.0, SQL 2017.

Final try is i'm letting my test DB go quiet for a little bit. See if 0 events for awhile causes it.

a-teece commented 11 months ago

@spaghettidba I'm still going after an hour and at numerous points I shutdown my workload to give long pauses of 0 activity.

What SQL version can you replicate on; I'm using 2017. What config file; I'm using the default AnalysisSample.json with different server names, db names and passwords

spaghettidba commented 11 months ago

I'm using 2022 both for replay target and analysis target. I'm using a very simple analysis.json, same as you are (sample with changed server and database). I'm running a full replay with 1.6.2 to see if I'm losing events, after that I'll try another run with 1.7.0 to reproduce the incorrect behavior and compare results.

a-teece commented 11 months ago

@spaghettidba I need to push my dev machine to SQL 2022 anyway so will try and find to check with that too

spaghettidba commented 11 months ago

Thank you very much! I can confirm that 1.6.2 runs perfectly. Now I'll try again 1.7.0 and will post the results.

spaghettidba commented 11 months ago

@a-teece confimed, 1.7.0 stops collecting Extended Events data after one or two minutes. One thing to try to reproduce the behavior might be this: I have several timeouts in this replay, so things that run slower on 2022 compared to 2019 (where the production workload was recorded). Also the operation mentioned by @SimonBillinghurst is a backup, so likely runs longer than a collection window of 1 minute. Could it be that long running processes break the XE collection? A simple WAITFOR DELAY '00:10:00' in the workload might help troubleshooting.

spaghettidba commented 11 months ago

Found it! Super sneaky default value assignment introduced in 1.6.5 to remove \0 chars at the end of strings:

StreamXEventDataReader.cs:269

Current code:

            // check whether last char is a null char (\0)
            // because this breaks writing this string to the sqlite database
            // which considers it as a BLOB
            if(result is string stringValue)
            {
                while (stringValue.EndsWith("\0"))    // <--- see? checks stringValue but assigns result in the loop
                {
                    result = stringValue.Remove(stringValue.Length - 1);
                }
            }
            return result;

New code:

            // check whether last char is a null char (\0)
            // because this breaks writing this string to the sqlite database
            // which considers it as a BLOB
            if(result is string stringValue)
            {
                while (stringValue.EndsWith("\0"))    // <--- see? checks stringValue but assigns result in the loop
                {
                    stringValue = stringValue.Remove(stringValue.Length - 1);
                }
                result = stringValue;
            }
            return result;

I'm releasing 1.7.1 soon

a-teece commented 11 months ago

@spaghettidba Really sorry this was in my change. Not sure why I cannot reproduce on my dataset. Assuming that none of my workload includes a value that ends with \0

spaghettidba commented 11 months ago

No need to be sorry! You introduced great changes to the code and I'm immensely grateful for your contributions! Moreover, I think it was the code analyzer to propose that change, you simply accepted it. I'm not sure where these \0 chars come from, but it seems like some applications end their commands with that control sequence and it breaks some other parts of the code later down the line in WorkloadTools, so I simply get rid of them.

SimonBillinghurst commented 11 months ago

Downloaded the latest version and now seems to be working. I really appreciate the effort guys, thanks very much.