FileWorkloadListener very slow with large files

a-teece commented 9 months ago

We have taken a full day capture using ExtendedEventsWorkloadListener of a very busy system. This has a resulted in a 205gb sqlite file with 295,940,938 events.

I appreciate this may be above-average, but using a load testing tool to replicate a real-world workload size seems reasonable.

When starting the replay the FileWorkloadListener took hours to even output the information line "Info - WorkloadTools.Listener.File.FileWorkloadListener : The source file contains 295940938 events." (which for large numbers would be nice with a little regional formatting).

It then took a VERY long time after that debug message to start playback.

I have found that using DB Browser for SQLite that the same basic "SELECT COUNT(*) FROM Events" statement takes hours too. However using the "Vacuum" first (which took several hours) "fixed" this. It shaved 5gb off the file, but the restructuring of the internal-data has had a big impact. The SELECT COUNT... query now runs in 65s which is quite an improvement from several hours!

I do not use Sqlite so have very limited knowledge, but I believe there are options to auto-vacuum as it is going along or when the file closes.

As a minimum I hope that this Issue report helps anyone else who is working with a large capture file.

spaghettidba commented 9 months ago

Thank you for reporting this! Well, 205GB is a big file. I am not sure sqlite was designed for this. It is also a lot of text, which is another factor the hurts performance. If I recall correctly, I opted against auto vacuum to cope with the insert rate. So what do you suggest? Running vacuum before closing the capture (would take hours, as you said)? Add an option to run vacuum before the replay? Thanks again for your valuable feedback!

a-teece commented 9 months ago

I'm not sure I have many suggestions with Sqlite as I learnt all of this yesterday while investigating why it was being so slow. Once I realised the Sqlite db engine was being slow outside of the Workload Tools that is where I focused.

I think maybe a command-line switch to enable vacuum when closing the file with appropriate documentation as to the pros/cons. Running it before the replay could also work, but I dont think makes as much sense because you only need to vacuum once, you only capture once, but you may replay many times and wouldn't want to vacuum ever time.

It is also worth noting that the vacuum command filled up the C: drive on the server with a temp file (over 100gb) and failed. I had to research how to force a different location for the temp data.

I guess thinking-out-load my suggestion might be to optimise the use of Sqlite - if we know a SELECT COUNT(*) is slow then don't do one. Just have a "CaptureInfo" table that is written to at the end of the capture with the count in it based on a simple in-memory int64 in WorkloadFileWriterConsumer? Could the Events be split across multiple tables by event_sequence or something as a kind of manual horizontal partitioning to avoid any 1 table getting 200m rows?

Or would a different file format simply be better given the FileWorkloadListener seems to have very simple requirements. Even the filtering; I would imagine that unless there are supporting indexes in the Sqlite file then the use of a filter may make it worse - given that the file is local instead of a remote db server is there even any perf benefit in applying the filter in Sqlite vs in the Read() method itself, it's the same CPU doing the logic unlike if your shuffling data from a SQL Server to an Application to then discard 20% of it.

spaghettidba commented 9 months ago

I guess thinking-out-load my suggestion might be to optimise the use of Sqlite - if we know a SELECT COUNT(*) is slow then don't do one

Well, looking at the code, I don't really need the count, it is used only to calculate the progress. I guess I can figure out something else.

spaghettidba commented 9 months ago

Looks like I can read the count from metadata. Do you still have the sqlite file available? Can you please run this for me and see if it is fast and accurate enough?

ANALYZE main;
select stat from sqlite_stat1 WHERE tbl = 'Events';

a-teece commented 9 months ago

Because I didnt know what I was doing with Sqlite I kept a backup before running VACUUM. Running your command now, it hasn't returned quickly. Will post back later

spaghettidba commented 9 months ago

Ouch! I'm going through the docs and it seems it will scan the whole table. It is probably better to run this:

PRAGMA analysis_limit=1000;
ANALYZE main;
select stat from sqlite_stat1 WHERE tbl = 'Events';

... to avoid scanning the whole thing. The Events table never gets deletes, so it should not need re-analyzing, but you never can tell.

a-teece commented 9 months ago

I was surprised that VACUUM made a difference considering it seemed like a 0 delete table, so how could it fragment to the point of having perf issues on a basic query.

Either way, on the un-vacuumed file this query has been running for a few minues now. Running just "select stat from sqlite_stat1 WHERE tbl = 'Events';" on the vacuumed file though gives an error that no such table sqlite_stat1 exists. Could that be because ANALYSE or OPTIMIZE have never been run?

spaghettidba commented 9 months ago

Quite possible. I'm not a sqlite expert by any means

a-teece commented 9 months ago

So your second query with the analysis_limit did return (I think that was about an hour). The stat it returned is accurate 295940938 1001 2 I'm not sure what the 1001 2 means

a-teece commented 9 months ago

FYI It doesn't appear to just be the COUNT(*).

I'm running the replay again now on a machine and have noticed that the sql temp file "etilqs_xxxxxxxxxxxxxx" in my temp folder is 173gb in size! It's like the act of using the SQLiteDataReader results in the whole response being copied to a temp file, which explains the second big "pause" before the replay starts. Maybe this is so it is a repeatable read?

So would it make more sense for the datareader to do SELECT row_id FROM Events. Then per-event fire a second query WHERE row_id = $row_id like for ReadWaits, etc. I know not everyone is using a 200gb trace. But if my 200gb trace is a 173gb temp it isn't unreasonable to suggest that a 20gb trace is a 17gb temp file which might catch people by surprise on their servers!

I did try changing the Journal Mode to WAL as I thought it might result in a repeatable read without large temp storage but the file wouldn't open in SQLWorkload any more, old version of SQLite maybe? Given the Replay is a read-only activity I wondered if a journal mode of OFF might avoid it.

spaghettidba commented 9 months ago

That's great feedback, thank you. I'll try to capture a big workload and reproduce all the above to see what I can do to address all the issues. I can't promise super fast turnaround, sorry.

spaghettidba commented 9 months ago

I had a look at the code and the sqlite database is already opened with the following options in the connection string:

Read Only=True;Journal Mode=Off;Synchronous=Off;

I'm afraid there is no way to avoid the creation of a big etilqs file when scrolling a DataReader, unless you read in batches, based on rowid. Honestly, I'm not sure this would make things faster, but I need to try.

a-teece commented 9 months ago

So I spent all day yesterday in the source code for Workload Tools, trying to get it all working. I have a fork on GitHub and will publish the fixes and send your a PR soon. The biggest problem I found (and the one that has caused us to think that a propsed Azure MI infrastructure performs aweful compared to our SQL Server is because even though I was asking for the replay to be synchronized it wasn't - so the tool was only working in stress test mode. After putting lots of debug lines into the timing code where you calculate the offset I found the offsets were all over the place from +60000000ms to -128384329832932ms with start times of row 1 and row 2 that were on different days (24hr trace). This morning I have traced that to the column you sort by event_sequence being null (at least in some rows) - bug in the capture side I assume? I'm working around this by sorting by starttime then row_id instead for now. This also explains why I got so many FK errors during the replay, it was doing things in an almost random order!

So the fixes I've made are:

row count didn't work if filters were applied (in the original count(*) version or your perf-optimised max version).
events replayed in a random order because event_sequence is null. I have worked-around by changing the sort rather than investigate the column population.
added a warning messages if the delay cannot be observed because the replay gets "behind". I added this before finding the sort problem, but left it in as it feels very useful.
optional support for lazy-loading the events to avoid hours of delay starting and 100s gb of temp space being required for very large traces.

Generally the Nlog usage seems a little strange. Outputting exception.message, exception.stacktrace, etc and messages with parameters in-lined instead of passed as parameters. Generally should only need to output the exception object and have variables as parameters for better de-dupe of unique messages for some targets. I think some work on how the Nlog output is written to files/displayed in console would then mean these little hacks aren't required and the detail will be much richer. I'll try and take a look if I get more time, it isn't critical.

Your welcome to look at my fork now, I'll submit a PR once I finish testing.

spaghettidba commented 9 months ago

This is awesome! Thank you very much for all the effort that you put into contributing to the project!

Regarding your points, it would be great if I could trace the root cause of some rows being recorded with event_sequence = NULL. I am a bit concerned about sorting by starttime, since the column is not indexed. How is performance using this sort? Regarding NLog, I'm not sure I understand your point. Do you have an example?

Thanks again for your contributions!

a-teece commented 9 months ago

Regarding the perf, it takes ~30 minutes to read the "full" data into the data reader, i.e. Sqlite applies the sort and loads the result into it's temp file. Strangely this isn't actually slower... I wonder if the NULLs are causing a problem for an index on event_sequence?

Regarding Nlog (and it needed an update to 4.5 to work with named variables instead of indexed ones). The best practice is to NOT do: logger.Info($"This is my message for user {username}"); DO: logger.Info("This is my message for user {username}", username);

This means that the username is passed as a parameter to nlog, which can pass it on to the target. For the text writer your using it actually makes no difference really. But if you were to output to a more advanced target, or were to want to parse the text fine into a tool such as Greylog, DataDog, etc then having the variables separate to the message allows you to do things like "give me all the messages with the variable Username = 'Andrew'". By setting the layout of the text file target to include "withException=true" then if you use logger.Error(exception, "message"); then you get all the details of the exception. no need to try and output different parts of it as text too.

I'm quite in-experienced with Nlog directly, prefering Serilog.... but they are the same concepts - if NLog is the same as Serilog then best practice is also not to end log messages with a ".", unlike exceptions which should always end with a ".".

I'm still working through some things because when replaying my work load it ends up getting behing, so my new warning message "The last xxxxx Commands requested a delay but replay is behind were processed immediately which may indicate that the target system cannot keep up with the workload" triggers, and the SQL Server is pounded into 90% CPU usage! Even through the trace and replay are being done to identical spec virtual infrastructure. So I suspect something is still going wrong with the "wait" timing code for this to be occuring.

I'll try and look in the trace to see if I can see anything common regarding the null event_sequence. The capture was done from SQL 2022, so I needed to use the fix from issue #134.

spaghettidba commented 9 months ago

Aweome, thanks for the feedback. Regarding NLog, I trust your experience. I'm a DBA, not a real developer, so you're the authority here. Very keen to see your PR!

a-teece commented 9 months ago

So all of the events with no event_sequence are either 4, 5 or 6.

For PerformanceCounter(4), 100% of the rows have a null event_sequence. For timeout (5), 100% of the rows have a null event_sequence. For WAIT_stats (6), 100% of the rows have a null event_sequence.

FYI not sure there is a supporting index for event_sequence either. The index is on session_id ASC, event_sequence DESC. Adding an index on starttime, row_id makes the lazy-loading pointless. The data reader opens almost instantly and Sqlite doesn't copy everything to the temp file.

spaghettidba commented 9 months ago

Ah, I see. It's by design: events of type 4,5 and 6 have no event_sequence. I really don't understand how this breaks the replay, because those events don't get replayed at all.

Thanks for looking at the indexes. So you mean that we don't need to do anything to the code if we add an index on starttime, row_id? That would be awesome!

a-teece commented 9 months ago

It breaks the sort because 1 of those, randomly is replayed first. That sets the base startDate variable to a value that might be 1/2 way through the workload.

The knock-on of that is that some events are incorrectly overdue, some events may sleep for a very long time.

Add an index on starttime, row_ID and then change the code to sort by those and all looks good.

My changes have ended up going off in all sorts of directions though so you'll have to go through a PR carefully to see if you agree. I'm still facing issues where the replay is causing 100% CPU on the target SQL server of identical spec to the source. Fixing the "sleep time" was only the first issue. I've also found the workload tool itself is wasting CPU cycles on spins, which cause the workload to get behind, which causes it to skip the delays, which causes the high-CPU.

spaghettidba commented 9 months ago

Ahhhhhh!!!!!! So maybe I could skip those with a WHERE clause (they are not needed for the replay). So the index could be on event_type, start_time, row_id.

Regarding spins and sleeps, I had to add them in the code to avoid looping forever on the queue burning CPU. Some async/await magic could probably fix the problem but, as, I said, I'm a DBA pretending to be a programmer... Thanks again for your help!

a-teece commented 9 months ago

Would filtering them when loading them with the FileWorkloadListener be safe for all consumers?

For instance are they used if the listener is FileWorkloadListener and the consumer is Analysis?

I thought about switching to Async but that is a very significant re-write. These days it is my starting point on new code though, it is easy to include a sync method call in Async code, very hard to include an async call from synchronous code.

My "fix" for the infinate loop is to allow the ReplayWorker task to end when it's out of commands. The ReplayConsumer will then start a task back up again if new commands arrive for it, or it will eventally be removed as per your original code. TBH all I had to do was delete a bunch of things and whats left is simpler and faster.

spaghettidba commented 9 months ago

Great point about the other consumer types. I'm looking forward to seeing the PR

spaghettidba / WorkloadTools

FileWorkloadListener very slow with large files #136