sosreport / sos

A unified tool for collecting system logs and other debug information
http://sos.rtfd.org
GNU General Public License v2.0
507 stars 543 forks source link

Can we have busy disk stats? #2134

Open pponnuvel opened 4 years ago

pponnuvel commented 4 years ago

Can we have a plugin for getting info on disk I/O (or how busy disks are)?

It's often useful to know how loaded the disks are when debugging "slow" I/O activities. As far as I can tell, this information isn't available in sosreports via any other means right now.

Something like iostat -x is what I am thinking of. Thoughts?

bmr-cymru commented 4 years ago

Currently if either sar or pcp is installed and configured we will collect some of the historic performance data they record; this may or may not be useful for you depending on how they have been set up and the specific metrics you care about.

The problem with running tools like iostat (or generally xstat commands that have an interval/count sampling model) is that the first iteration of data is misleading (data since boot) so we can't simply fire the command and gather whatever comes out. There's also a problem here with how we currently schedule command execution in sos: if a report contained multiple xstat invocations they would all be on a different timescale - potentially with significant skew between the different start times.

We did discuss an API for collecting performance data that would allow these to all be started roughly together but that was some time ago and it didn't get past the discussion stage at that point (iirc folks were telling us they preferred to use things like collectl or pcp for that), but it's something that could be revisited if there's enough interest - the team is currently working on sos-4.0 so it may come down to whether or not there's enough space in the schedule. @TurboTurtle would be the best person to know that.

TurboTurtle commented 4 years ago

We (well, @pmoravec) overhauled the pcp collection not too long ago, as that's been on the rise in popularity (at least in the RH family distros) to use for performance data collection. As @bmr-cymru mentions though it is not guaranteed to have the needed data all the time.

Is there specific data within iostat that default/standard pcp configurations don't capture for you? I'm just trying to gauge if this is "needed" data or "helpful but we also have this other thing that provides similar information".

As for timelines with 4.0, I don't initially think there'd be enough time to make API level changes for standardizing performance metrics, but we'd need to (re-)discuss that in depth. 4.0 closes in August, so it'd be tight regardless. 4.1 may be more reasonable which would be 6 months after 4.0, so in the Feb 2021 window.

TurboTurtle commented 4 years ago

Of course if this is absolutely needed data, we could also simply fire off an initial iostat command within the plugin, throw it away, and then capture a second "real" run of the command. We'd run into the timescale differences that Bryn mentions, but perhaps that's not a huge issue to the engineers digging through sosreports.

bmr-cymru commented 4 years ago

I think the ideal would be to collect a single run of the command, but for multiple intervals (either a set number of intervals, or the number of intervals we get done by the time we're finished with collect() for e.g.).

I think one reason I shied away from the performance counter command API was that it would introduce a need to schedule things across plugins: I'm still inclined to think that this is a bit much work for 4.0, but I do see a way for it to work that could maybe land in a subsequent release.

The sort of thing I'm thinking of is a global list of periodic performance counter collecting commands: plugins would add those with something like:

   def add_stat_cmd(cmd, interval=1, count=None)

    [...]
        self.add_stat_cmd("iostat -x", interval=1)

Then, at the end of setup() / transition to collect() we kick all of these commands off as quickly as possible in separate threads. They then either run to completion if count is not None, or we kill them at the end of collect(). That way everything is on roughly the same timebase - it's not as good a method as say PCP uses for its sampling timelines but it would mean that you could e.g. fire off iostat, mpstat, vmstat, cifsiostat and others that obey the command [args] interval count convention from different plugins and still expect reasonably comparable data from them.

This would bring the bigger picture perf stuff more "in-house" and would mean we could get something useful without fullblown frameworks like PCP or SAR (ime many systems do have sysstat etc. installed).

pponnuvel commented 4 years ago

Is there specific data within iostat that default/standard pcp configurations don't capture for you? I'm just trying to gauge if this is "needed" data or "helpful but we also have this other thing that provides similar information".

It belongs to the latter category. As after a close look, I relaize that sar collects similar information. I think sar is more likely to be available than pcp on most systems as I don't see pcp data as much. So I like Bryn's idea with a focus on collecting perf data.