googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6k stars 791 forks source link

Metrics: Export to Stackdriver is not working #1330

Closed aLekSer closed 4 years ago

aLekSer commented 4 years ago

There are no stackdriver metrics due to error in Labels:

Failed to export to Stackdriver: rpc error: code = InvalidArgument
 desc = One or more TimeSeries could not be written:
 Unrecognized resource label: instance_id: timeSeries[2,3,12];
 Unrecognized resource label: namespace_id: timeSeries[1,11]; 
Unrecognized resource label: pod_id: timeSeries[0,5,7]; 
Unrecognized resource label: zone: timeSeries[4,6,8-10]

What happened:

No stackdriver metrics on the dashboard, which was working several month ago. New errors on Agones Controller logs.

What you expected to happen: Stackdriver metrics are correctly visualised.

How to reproduce it (as minimally and precisely as possible): https://agones.dev/site/docs/guides/metrics/#stackdriver-installation

What should be done to fix an issue

Anything else we need to know?: There are two Pull requests which solved mentioned above ticket They contain fixes for:

func getMonitoredResource(projectID string) (*monitoredres.MonitoredResource, error) {
...
}

Environment:

aLekSer commented 4 years ago

We can switch from getMonitoredResource() function to monitoredresource.Autodetect() after updating to version 0.22 of opencensus and contrib.go.opencensus.io/exporter/stackdriver v0.12.0 as it is done in latest example here: https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/6ee7f9652d2a9e707fea22c56d06235db6289426/examples/stats/main.go#L51

markmandel commented 4 years ago

@bbf Have you seen this on recent releases?

bbf commented 4 years ago

While I have not tested any recent releases, I can imagine why a few things stopped working. I'm very interested in overhauling the Stackdriver integration of Agones, so if possible give me some time to look into it.

It was already in my plans to propose some changes to have a better alignment between Agones and the new monitoring agent used by GKE on Stackdriver, so addressing that while fixing this bug might be ideal.

@markmandel / @aLekSer WTDY?

aLekSer commented 4 years ago

Hello, I managed to make it working by updating exporter's Monitored resource yesterday. I will send a PR, it involves update of the Opencensus to 0.22 this update slow me down a bit. @bbf I will send a draft PR soon so you can review

aLekSer commented 4 years ago

Well switching to AutoDetect() was not working on recent OpenCensus and stackdriver-exporter as well: https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/master/monitoredresource/gcp_metadata_config.go#L100 I will rewrite getMonitoredResource() for a fast fix. And then need to understand why Autodetect():

    resT, lab :=  monitoredresource.Autodetect().MonitoredResource()
    logger.Info("Monitored Resource: ", resT, " ", lab)

returns on test-cluster GKE:

Monitored Resource: gke_container map[cluster_name:test-cluster container_name:agones-controller instance_id:1205178163407041488 namespace_id: pod_id:agones-controller-59bd95c448-dwp88 project_id:agones-alexander zone:us-west1-c]

While working scenario is k8s_container as in upcoming PR

aLekSer commented 4 years ago

Also we receive errors for Prometheus exporter:

textPayload: "2020/02/07 15:14:14 Failed to export to Prometheus: inconsistent label cardinality: expected 1 label values but got 0 in []string(nil)

Which seems to be https://github.com/census-instrumentation/opencensus-go/issues/659 with a fix https://github.com/census-instrumentation/opencensus-go/pull/989

markmandel commented 4 years ago

I defer these things to you two :smile: my knowledge of metrics is very low.

I definitely advocate for a working solution :grin:

markmandel commented 4 years ago

@cyriltovena have you got any feedback here?

aLekSer commented 4 years ago

Currently on Master Prometheus is working, but contains such error message in Agones Controller logs. PR #1335 adds working stackdriver. Update to OpenCensus 0.22 could be done after this fix, to split up the process. I thought to update in single PR, but as in #893 all tests should be updated.

markmandel commented 4 years ago

Is this fixed now?

aLekSer commented 4 years ago

Stackdriver would be fixed after PR, now I am grabbing screenshots from Grafana to compare with a previous one made by @cyriltovena as part of #1479