kolide / launcher

Osquery launcher, autoupdater, and packager
https://kolide.com/launcher
Other
503 stars 98 forks source link

Queries may be slow to run on Windows in the first ~5-10 minutes after launcher startup #1784

Open RebeccaMahany opened 1 month ago

RebeccaMahany commented 1 month ago

The automated tests are flaky on Windows right now specifically because launcher does not receive and process a live query within 5 minutes of osquery starting on launcher startup. I've seen this issue pretty consistently in the tests, and now have seen a report of an issue that seems pretty similar -- I think this is worth investigating.

directionless commented 1 month ago

Maybe related to https://github.com/kolide/launcher/issues/1442

RebeccaMahany commented 1 month ago

Ooh, interesting. I also noticed that this last report coincided with kolide_wmi logs, and I know we've flagged those as potentially not performant before.

RebeccaMahany commented 3 weeks ago

Findings thus far:

I added logging for slow-running queries in https://github.com/kolide/launcher/pull/1823. What I found aligned with what seph saw when looking at Honeycomb -- that it doesn't seem to be an issue with the queries themselves. Even very simple queries could be very slow. Also, typically the wall_time was high, but the system/user time was negligible.

We theorized that the test machines themselves might be struggling, so I increased the size of the test VMs. However, this had no effect -- the test VMs still took an incredibly long time to move through their distributed queue.

I looked at related traces this morning -- code.function:github.com/kolide/launcher/pkg/osquery.(*Extension).GetQueries and code.function:github.com/kolide/launcher/pkg/osquery.(*Extension).WriteResults. I mostly saw that whenever these traces took a while, it was almost entirely during the portion where they communicate with K2. However, that still wasn't consistently slow enough to explain this overall issue.

RebeccaMahany commented 3 weeks ago

I tested to see if distributed query results are slow to be processed by K2, since we're also seeing an issue where the live queries created by the tests never have their results processed by K2. It's possible the distributed query results are slow to be processed by K2, but they do get processed by K2 -- and we aren't seeing queue delays on the K2 side. This doesn't seem to be the explanation we're looking for.