Closed bobheadxi closed 4 years ago
cc @daxmc99
@uwedeportivo would a query like the following (or similar) suffice?
(rate(container_fs_reads_total{name=~".*zoekt.*"}[5m]) + rate(container_fs_writes_total{name=~".*zoekt.*"}[5m]))
Metrics reference: https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics
Still unable to figure out what is up here - both container_fs_reads_total
and container_fs_io_*
are in the diskIO
category, but we have the former and not the latter. Inspection of cadvisor /validate
doesnt turn up anything that looks suspicious, except:
Block device setup: [Supported, but not recommended]
None of the devices support 'cfq' I/O scheduler. No disk stats can be reported.
Disk "dm-0" Scheduler type "none".
Disk "sda" Scheduler type "mq-deadline".
Disk "sdb" Scheduler type "none".
Disk "sdc" Scheduler type "mq-deadline".
https://github.com/google/cadvisor/issues/1946 indicates that this is not a configuration issue.
Some other metrics I see we're missing:
container_fs_read_seconds_total
container_fs_write_seconds_total
A few of these missing stats seem to only be collected for cAdvisor itself
Also of note: https://github.com/sourcegraph/deploy-sourcegraph-dogfood-k8s/pull/1580 doesnt seem to have made a difference, comparison of /validate
on cadvisor nodes in Cloud and k8s.sgdev.org indicates that cgroup stats are already inspected via the rootfs
mount
oops was accidentally closed
we might be hitting restrictions of the persistent volume. i'll bring it up in today's sync
@bobheadxi I can sync with you to help here
post-sync with @daxmc99 : container_fsio* is per-node, hence why it is only available (seemingly) for cadvisor pods (this actually was indicating that it is only available per-node, since cadvisor is deployed as a daemonset). This is not super useful for us, since we want metrics per-service
a useful approx might be reads_total and writes_total%20%2B%20sum%20by(name)(rate(container_fs_writes_total%7Bname%3D~%5C%22.gitserver.%5C%22%7D%5B5m%5D))%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D), which is available per-service:
we could set up something like this for:
question: how do we set an alert on this? an idea is to "scale" the total read/write with repos cloned to get a number relative to deployment size, a la this query%20%2B%20sum%20by(name)(rate(container_fs_writes_total%7Bname%3D~%5C%22.gitserver.%5C%22%7D%5B5m%5D)))%20%2F%20ignoring(name)%20group_left%20sum(src_gitserver_repo_cloned)%22,%22hide%22:false%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D):
(sum by(name)(rate(container_fs_reads_total{name=~".*gitserver.*"}[5m])) + sum by(name)(rate(container_fs_writes_total{name=~".*gitserver.*"}[5m]))) / ignoring(name) group_left sum(src_gitserver_repo_cloned)
i don't really know what this number means yet, or if it is a useful measure, but @daxmc99 pointed out that just the shape of this chart seems to indicate some kind of abnormality going on with gitserver
cc @keegancsmith and @slimsag
We need a way to correlate that with a service doing something. @efritz did code intel deploy something new recently that would query gitserver a lot more and periodically?
An alert for this is tricky. I would suspect sustained high IO is a sign of under provisioning / runaway process. Sustained IO would also potentially be useful for an admin to know about. So I would make a very hard to trigger alert in this case. Otherwise there isn't really much action an admin can do. From sourcegraph.com perspective we can make it a bit easier to fire since we have more context to debug it.
What's recently? The only thing that hits gitserver with any fervor is the indexer (periodically every 30m) and the worker (on uploads, but only relative to the depth of the repo tree).
I'll merge https://github.com/sourcegraph/sourcegraph/pull/12379 just to see if it helps.
What's recently?
looks like it started ~8/11:
An alert for this is tricky. I would suspect sustained high IO is a sign of under provisioning / runaway process. Sustained IO would also potentially be useful for an admin to know about.
how does sustained minimum per gitserver > x for y time sound? query%5Cn%2B%20rate(container_fs_writes_total%7Bname%3D~%5C%22.gitserver.%5C%22,name!~%5C%22.(POD%7Cjaeger-agent).%5C%22%7D%5B5m%5D)))%20%3E%202000%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D):
min by(name)(sum by(name) (rate(container_fs_reads_total{name=~".*gitserver.*",name!~".*(_POD_|_jaeger-agent_).*"}[5m]) + rate(container_fs_writes_total{name=~".*gitserver.*",name!~".*(_POD_|_jaeger-agent_).*"}[5m])))
From sourcegraph.com perspective we can make it a bit easier to fire since we have more context to debug it.
🤔 this is a separate topic, but since we don't have support for custom thresholds (for example, generating different thresholds for dot-com) we should either:
Yeah this is tough. We want to make it easy for anyone to add alerts to sourcegraph.com, without having concerns about the impact on customer instances. But if we do that then customer alerts end up becoming stagnant. I think this is the same case with the latter observation on "two observables". I think a label for "large instance" metrics makes sense to me. Then if someone is having issues on a smaller instance they can see what the large instance alerts are (but they aren't normally visible/alert).
Your suggested query sounds good, but looks like it would trigger a bunch right now! I think the correct course of action is for someone to own finding out why gitserver has such elevated IO levels at the moment and then we can either fix the problem or adjust the alert to not fire. For now if we have alert along these lines it should be silenced.
In our next hack session @bobheadxi we can maybe try work out which repo is causing this pattern of access via honeycomb. I would suspect the IO access patterns correlate with size of HTTP response. So that will likely tell us the problem (without looking, I bet you we are constantly trying to fetch ghdump repo or something).
Dear all,
This is your release captain speaking. 🚂🚂🚂
Branch cut for the 3.19 release is scheduled for tomorrow.
Is this issue / PR going to make it in time? Please change the milestone accordingly. When in doubt, reach out!
Thank you
Note: this was resolved in https://github.com/sourcegraph/sourcegraph/pull/13062 by alternative metrics that seem available
IO metrics has been requested a few times (eg), should be possible via
container_fs_io_*
from cAdvisor, but I think we're missing some mounts (cgroup?)https://github.com/google/cadvisor/issues/2174#issuecomment-467133687 https://github.com/google/cadvisor/issues/1718#issuecomment-363495282
Missing metrics seem to be from the
diskIO
group in https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics