opserver / Opserver

Stack Exchange's Monitoring System
https://opserver.github.io/Opserver/
MIT License
4.5k stars 823 forks source link

SignalFx Dashboard Module #376

Closed deanward81 closed 4 years ago

deanward81 commented 4 years ago

Adds a SignalFxDataProvider that renders sparklines, etc. using the SignalFlow socket API and REST API to retrieve metadata.

SignalFx does not provide 100% of the data that Bosun used to give us (e.g. it doesn't have detailed hardware information) so we're just providing metrics and minimal metadata here.

TODO

Once those are completed we should be good to merge this in

deanward81 commented 4 years ago

@NickCraver OK, I think this is everything ready to rock with the exception of incidents. Suggest we tackle those in a separate PR; SignalFx's API docs on that front are a little lacking.

I'm gonna spend a little moar time documenting the code but functionality-wise I think it's complete

NickCraver commented 4 years ago

Awesome! Poking at it locally and pushing a few fixes...and this cracked me up, the K8s nodes: image (that's never worked before - maybe we should limit the dashboard to like 5-10 or something?)

NickCraver commented 4 years ago

@deanward81 looking awesome! I'm seeing a lot of n/a on the dashboard we can dig on today or whenever - it's happening on the web tier and such as well so that makes me think it's happening randomly after some limit cutoff in the API or such and so we're not getting values for those servers. Here's local: image ...but checking the metric directly (e.g. memory.used), it is there, so smells like an API limit of first n maybe? image

Not a problem, but possible enhancement on the data fetch: locally I'm using this excludePattern as a first pass:

{
  "Modules": {
    "Dashboard": {
      "providers": {
        "signalfx": {
          "realm": "us1",
          "accessToken": "..."
        }
      },
      "excludePattern": ".*\\.ds\\.stackexchange\\.com|\\.k8s",
      "categories": [
        {
          "name": "Database Servers", // Name for this group of servers
          "pattern": "-sql", // Regex pattern of server names to put in this group
          "cpuWarningPercent": 20,
          "cpuCriticalPercent": 60,
          "memoryWarningPercent": 98,
          "memoryCriticalPercent": 99.2,
          "primaryInterfacePattern": "-TEAM$"
        },
        {
          "name": "Web Servers",
          "pattern": "-web|-promoweb|-vmweb",
          "cpuWarningPercent": 25,
          "memoryWarningPercent": 75,
          "primaryInterfacePattern": "-TEAM$|-TEAM · Local"
        }
      ]
    }
  }
}

...any chance their API supports host exclusions? Would just mean fetching a lot less data, etc. The *.ds.stackexchange.com entries are unfortunately all duplicates after some option change, so wondering if we can filter them before client side - was gonna dig on API docs a bit this morning to poke.

deanward81 commented 4 years ago

@NickCraver hmmm, yeh I see the same. Pretty sure this isn’t any kind of throttling; SignalFlow has no mention of such things. I’ll dig a little, see what’s going on!

Re. exclusions; we can absolutely do it on their side but... the exclusion list for prod is kind of silly, would need to see if we can send a regex up for the filtering!