Open RebeccaMahany opened 4 months ago
Maybe related to https://github.com/kolide/launcher/issues/1442
Ooh, interesting. I also noticed that this last report coincided with kolide_wmi
logs, and I know we've flagged those as potentially not performant before.
Findings thus far:
I added logging for slow-running queries in https://github.com/kolide/launcher/pull/1823. What I found aligned with what seph saw when looking at Honeycomb -- that it doesn't seem to be an issue with the queries themselves. Even very simple queries could be very slow. Also, typically the wall_time
was high, but the system/user time was negligible.
We theorized that the test machines themselves might be struggling, so I increased the size of the test VMs. However, this had no effect -- the test VMs still took an incredibly long time to move through their distributed queue.
I looked at related traces this morning -- code.function:github.com/kolide/launcher/pkg/osquery.(*Extension).GetQueries
and code.function:github.com/kolide/launcher/pkg/osquery.(*Extension).WriteResults
. I mostly saw that whenever these traces took a while, it was almost entirely during the portion where they communicate with K2. However, that still wasn't consistently slow enough to explain this overall issue.
I tested to see if distributed query results are slow to be processed by K2, since we're also seeing an issue where the live queries created by the tests never have their results processed by K2. It's possible the distributed query results are slow to be processed by K2, but they do get processed by K2 -- and we aren't seeing queue delays on the K2 side. This doesn't seem to be the explanation we're looking for.
The automated tests are flaky on Windows right now specifically because launcher does not receive and process a live query within 5 minutes of osquery starting on launcher startup. I've seen this issue pretty consistently in the tests, and now have seen a report of an issue that seems pretty similar -- I think this is worth investigating.