cadvisor: investigate collecting IO metrics

bobheadxi commented 4 years ago

IO metrics has been requested a few times (eg), should be possible via container_fs_io_* from cAdvisor, but I think we're missing some mounts (cgroup?)

https://github.com/google/cadvisor/issues/2174#issuecomment-467133687 https://github.com/google/cadvisor/issues/1718#issuecomment-363495282

Missing metrics seem to be from the diskIO group in https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics

bobheadxi commented 4 years ago

cc @daxmc99

bobheadxi commented 4 years ago

@uwedeportivo would a query like the following (or similar) suffice?

(rate(container_fs_reads_total{name=~".*zoekt.*"}[5m]) + rate(container_fs_writes_total{name=~".*zoekt.*"}[5m]))

Metrics reference: https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics

https://sourcegraph.com/-/debug/grafana/explore?orgId=1&left=%5B%22now-2d%22,%22now%22,%22Prometheus%22,%7B%22expr%22:%22(rate(container_fs_reads_total%7Bname%3D~%5C%22.*zoekt.*%5C%22%7D%5B5m%5D)%20%2B%20rate(container_fs_writes_total%7Bname%3D~%5C%22.*zoekt.*%5C%22%7D%5B5m%5D))%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D&right=%5B%22now-2d%22,%22now%22,%22Prometheus%22,%7B%22expr%22:%22container_fs_io_current%7Bname%3D~%5C%22.*zoekt.*%5C%22%7D%20%3E%200%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D

Still unable to figure out what is up here - both container_fs_reads_total and container_fs_io_* are in the diskIO category, but we have the former and not the latter. Inspection of cadvisor /validate doesnt turn up anything that looks suspicious, except:

Block device setup: [Supported, but not recommended]
    None of the devices support 'cfq' I/O scheduler. No disk stats can be reported.
     Disk "dm-0" Scheduler type "none".
     Disk "sda" Scheduler type "mq-deadline".
     Disk "sdb" Scheduler type "none".
     Disk "sdc" Scheduler type "mq-deadline".

https://github.com/google/cadvisor/issues/1946 indicates that this is not a configuration issue.

Some other metrics I see we're missing:

container_fs_read_seconds_total
container_fs_write_seconds_total

A few of these missing stats seem to only be collected for cAdvisor itself

bobheadxi commented 4 years ago

Also of note: https://github.com/sourcegraph/deploy-sourcegraph-dogfood-k8s/pull/1580 doesnt seem to have made a difference, comparison of /validate on cadvisor nodes in Cloud and k8s.sgdev.org indicates that cgroup stats are already inspected via the rootfs mount

bobheadxi commented 4 years ago

oops was accidentally closed

uwedeportivo commented 4 years ago

we might be hitting restrictions of the persistent volume. i'll bring it up in today's sync

daxmc99 commented 4 years ago

@bobheadxi I can sync with you to help here

bobheadxi commented 4 years ago

post-sync with @daxmc99 : container_fsio* is per-node, hence why it is only available (seemingly) for cadvisor pods (this actually was indicating that it is only available per-node, since cadvisor is deployed as a daemonset). This is not super useful for us, since we want metrics per-service

a useful approx might be reads_total and writes_total%20%2B%20sum%20by(name)(rate(container_fs_writes_total%7Bname%3D~%5C%22.gitserver.%5C%22%7D%5B5m%5D))%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D), which is available per-service:

we could set up something like this for:

[ ] gitserver
[ ] zoekt

bobheadxi commented 4 years ago

question: how do we set an alert on this? an idea is to "scale" the total read/write with repos cloned to get a number relative to deployment size, a la this query%20%2B%20sum%20by(name)(rate(container_fs_writes_total%7Bname%3D~%5C%22.gitserver.%5C%22%7D%5B5m%5D)))%20%2F%20ignoring(name)%20group_left%20sum(src_gitserver_repo_cloned)%22,%22hide%22:false%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D):

(sum by(name)(rate(container_fs_reads_total{name=~".*gitserver.*"}[5m])) + sum by(name)(rate(container_fs_writes_total{name=~".*gitserver.*"}[5m]))) / ignoring(name) group_left sum(src_gitserver_repo_cloned)

i don't really know what this number means yet, or if it is a useful measure, but @daxmc99 pointed out that just the shape of this chart seems to indicate some kind of abnormality going on with gitserver

cc @keegancsmith and @slimsag

keegancsmith commented 4 years ago

We need a way to correlate that with a service doing something. @efritz did code intel deploy something new recently that would query gitserver a lot more and periodically?

An alert for this is tricky. I would suspect sustained high IO is a sign of under provisioning / runaway process. Sustained IO would also potentially be useful for an admin to know about. So I would make a very hard to trigger alert in this case. Otherwise there isn't really much action an admin can do. From sourcegraph.com perspective we can make it a bit easier to fire since we have more context to debug it.

efritz commented 4 years ago

What's recently? The only thing that hits gitserver with any fervor is the indexer (periodically every 30m) and the worker (on uploads, but only relative to the depth of the repo tree).

I'll merge https://github.com/sourcegraph/sourcegraph/pull/12379 just to see if it helps.

bobheadxi commented 4 years ago

What's recently?

looks like it started ~8/11:

An alert for this is tricky. I would suspect sustained high IO is a sign of under provisioning / runaway process. Sustained IO would also potentially be useful for an admin to know about.

how does sustained minimum per gitserver > x for y time sound? query%5Cn%2B%20rate(container_fs_writes_total%7Bname%3D~%5C%22.gitserver.%5C%22,name!~%5C%22.(POD%7Cjaeger-agent).%5C%22%7D%5B5m%5D)))%20%3E%202000%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D):

min by(name)(sum by(name) (rate(container_fs_reads_total{name=~".*gitserver.*",name!~".*(_POD_|_jaeger-agent_).*"}[5m]) + rate(container_fs_writes_total{name=~".*gitserver.*",name!~".*(_POD_|_jaeger-agent_).*"}[5m])))

bobheadxi commented 4 years ago

From sourcegraph.com perspective we can make it a bit easier to fire since we have more context to debug it.

🤔 this is a separate topic, but since we don't have support for custom thresholds (for example, generating different thresholds for dot-com) we should either:

scale metrics off some measure of deployment type (for example, ratios based on total requests, or in this case my earlier idea about having a ratio based on gitserver repos)
have two observables, one tailored for dot-com/really large instances and one tailored for "normal instances", and mute the "normal instances" alert

keegancsmith commented 4 years ago

Yeah this is tough. We want to make it easy for anyone to add alerts to sourcegraph.com, without having concerns about the impact on customer instances. But if we do that then customer alerts end up becoming stagnant. I think this is the same case with the latter observation on "two observables". I think a label for "large instance" metrics makes sense to me. Then if someone is having issues on a smaller instance they can see what the large instance alerts are (but they aren't normally visible/alert).

Your suggested query sounds good, but looks like it would trigger a bunch right now! I think the correct course of action is for someone to own finding out why gitserver has such elevated IO levels at the moment and then we can either fix the problem or adjust the alert to not fire. For now if we have alert along these lines it should be silenced.

In our next hack session @bobheadxi we can maybe try work out which repo is causing this pattern of access via honeycomb. I would suspect the IO access patterns correlate with size of HTTP response. So that will likely tell us the problem (without looking, I bet you we are constantly trying to fetch ghdump repo or something).

marekweb commented 4 years ago

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.19 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly. When in doubt, reach out!

Thank you

bobheadxi commented 3 years ago

Note: this was resolved in https://github.com/sourcegraph/sourcegraph/pull/13062 by alternative metrics that seem available

sourcegraph / sourcegraph-public-snapshot

cadvisor: investigate collecting IO metrics #12163