grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.43k stars 211 forks source link

Improve operational confidence #253

Open tpaschalis opened 9 months ago

tpaschalis commented 9 months ago

We should build on our dogfooding experience to improve the operational confidence of users and help them run large-scale Grafana Agent deployments without worries.

### Tasks
- [ ] Add mixin dashboards for most common component namespaces 
- [ ] Ensure all mixin dashboards are consistent  
- [ ] Create opinionated set of alerts for grafana-agent-flow mixin 
- [ ] Write runbooks for mixin alerts 
agologan commented 9 months ago

Grafana Agent is very versatile and can replace most if not all our telemetry needs. Especially Flow makes it seem that you can drop all your use-cases in one river, when in fact I'm starting to feel that you should probably run at least 2 deployments: 1x daemonset, 1x statefulset as some components work better in one deployment vs the other.

Would appreciate some guidance in the docs surrounding this (per component topology recommendation) so others can avoid killing multiple nodes like I did with the default daemonset configuration using prometheus.operator.servicemonitors in the river without host filtering and without limits.

LE: A friend pointed out k8s-monitoring-helm which is a great example of running multiple agent deployments to cover different use-cases.

ptodev commented 8 months ago

Hi @agologan 👋 Do you mind raising a separate issue please? I suppose the issue should be for enhancing the Deploy doc. Please label is as type/docs so that our docs team can track it. This issue here is not so much for docs - it's more for dashboards, alerts, and runbooks.

ptodev commented 8 months ago

@tpaschalis @rfratto I'd be happy to help with this issue, but I'll need more information on what we need to improve. The issue description is quite broad. I don't know what the highest priority issues are.

agologan commented 8 months ago

After some careful consideration have marked my above comment as off-topic. While not completely irrelevant, my testimony is not very actionable and wouldn't want to waste maintainers' time with it unless there's widespread indication the docs need improving.

rfratto commented 8 months ago

I'd be happy to help with this issue, but I'll need more information on what we need to improve. The issue description is quite broad. I don't know what the highest priority issues are.

@ptodev I've added some extra information in a task list, but it is admittedly still vague. I will be spending time soon to create a more concrete list of tasks.

github-actions[bot] commented 7 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!