Closed ocervell closed 4 years ago
That PR does not support GKE, but this PR https://github.com/GoogleCloudPlatform/opentelemetry-operations-python/pull/24 in the new Google repo does.
Thanks looks good to me. Since this is not part of this repo, will that work without using the CloudMonitoringMetricsExplorer
?
I'm using the OpenCensusExporter
as part of an architecture OT SDK --> OT agent --> OT collector pipeline --> StackdriverExporter (Go) and would like this feature in the first part (SDK).
I think those resource attributes could be added by the SDK (explicit setup), and then propagated along the pipeline.
Resource Detection will work without using CloudMonitoringMetricsExplorer, you'll basically just want lines 24-32 from here https://github.com/GoogleCloudPlatform/opentelemetry-operations-python/blob/master/docs/examples/tools/cloud_resource_detector/resource_detector_metrics.py#L24. And it will pass in the resource info to OpenCensusExporter, but I don't believe it will be in the format that OpenCensusExporter requires them to be since that was built before the resource detector.
The OC exporter will need to be updated to send resource information, which should be simple enough looking at the code.
AFAIK, there is no general way to signal OT exporters which resource keys you care about, but I will double check the spec.
@AndrewAXue thanks, this should be doable. What package does the opencensus.tools.
live in ? I'm getting an error saying it can't find it:
Traceback (most recent call last):
File "/Users/ocervello/.virtualenvs/flask/lib/python3.7/site-packages/flask/cli.py", line 235, in locate_app
__import__(module_name)
File "/Users/ocervello/Workspace/dev/gunicorn-opentelemetry-poc/flask-app/app.py", line 28, in <module>
from opentelemetry.tools.resource_detector import GoogleCloudResourceDetector
ModuleNotFoundError: No module named 'opentelemetry.tools'
So the resource detectors are actually unreleased, you'll have to manually clone the repo https://github.com/GoogleCloudPlatform/opentelemetry-operations-python and pip install -e
the package "opentelemetry-exporter-cloud-trace" that that folder is in.
Okay, I've tried it but the resource still shows global
and I'm still not seeing any resource labels.
Here is the code I'm using:
import config
import os
import logging
import time
from flask import Flask
from opentelemetry import trace, metrics
#from opentelemetry.ext.flask import FlaskInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchExportSpanProcessor
from opentelemetry.sdk.metrics import Counter, MeterProvider
from opentelemetry.sdk.resources import get_aggregated_resources
from opentelemetry.ext.opencensusexporter.metrics_exporter import OpenCensusMetricsExporter
from opentelemetry.ext.opencensusexporter.trace_exporter import OpenCensusSpanExporter
from gke_detector import GoogleCloudResourceDetector
OTEL_AGENT_ENDPOINT = os.environ['OTEL_AGENT_ENDPOINT']
span_exporter = OpenCensusSpanExporter(service_name="flask-app-tutorial",
endpoint=OTEL_AGENT_ENDPOINT)
exporter = OpenCensusMetricsExporter(service_name="flask-app-tutorial",
endpoint=OTEL_AGENT_ENDPOINT)
resources = get_aggregated_resources([GoogleCloudResourceDetector()])
# Metrics
metrics.set_meter_provider(MeterProvider(resource=resources))
meter = metrics.get_meter(__name__, True)
metrics.get_meter_provider().start_pipeline(meter, exporter, 5)
# Traces
trace.set_tracer_provider(TracerProvider(resource=resources))
trace.get_tracer_provider().add_span_processor(
BatchExportSpanProcessor(span_exporter))
# Custom metrics
pid = os.getpid()
staging_labels = {"environment": "staging", "pid": str(pid)}
requests_counter = meter.create_metric(
name="flask_app_hello_requests",
description="Hello requests count",
unit="1",
value_type=int,
metric_type=Counter,
label_keys=(
"environment",
"pid",
),
)
# Flask application
app = Flask(__name__)
#FlaskInstrumentor().instrument_app(app)
# Logging setup
gunicorn_logger = logging.getLogger('gunicorn.error')
app.logger.handlers = gunicorn_logger.handlers
app.logger.setLevel(gunicorn_logger.level)
app.logger.info(f'Otel agent endpoint: {OTEL_AGENT_ENDPOINT}')
@app.route("/")
def hello():
app.logger.info("Received hello request !")
requests_counter.add(1, staging_labels)
app.logger.debug("Counter was incremented.")
return "Hello World!"
if __name__ == "__main__":
app.run(host="0.0.0.0", port=config.PORT, debug=config.DEBUG_MODE)
Okay, I've tried it but the resource still shows
global
and I'm still not seeing any resource labels.Here is the code I'm using:
import config import os import logging import time from flask import Flask from opentelemetry import trace, metrics #from opentelemetry.ext.flask import FlaskInstrumentor from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchExportSpanProcessor from opentelemetry.sdk.metrics import Counter, MeterProvider from opentelemetry.sdk.resources import get_aggregated_resources from opentelemetry.ext.opencensusexporter.metrics_exporter import OpenCensusMetricsExporter from opentelemetry.ext.opencensusexporter.trace_exporter import OpenCensusSpanExporter from gke_detector import GoogleCloudResourceDetector OTEL_AGENT_ENDPOINT = os.environ['OTEL_AGENT_ENDPOINT'] span_exporter = OpenCensusSpanExporter(service_name="flask-app-tutorial", endpoint=OTEL_AGENT_ENDPOINT) exporter = OpenCensusMetricsExporter(service_name="flask-app-tutorial", endpoint=OTEL_AGENT_ENDPOINT) resources = get_aggregated_resources([GoogleCloudResourceDetector()]) # Metrics metrics.set_meter_provider(MeterProvider(resource=resources)) meter = metrics.get_meter(__name__, True) metrics.get_meter_provider().start_pipeline(meter, exporter, 5) # Traces trace.set_tracer_provider(TracerProvider(resource=resources)) trace.get_tracer_provider().add_span_processor( BatchExportSpanProcessor(span_exporter)) # Custom metrics pid = os.getpid() staging_labels = {"environment": "staging", "pid": str(pid)} requests_counter = meter.create_metric( name="flask_app_hello_requests", description="Hello requests count", unit="1", value_type=int, metric_type=Counter, label_keys=( "environment", "pid", ), ) # Flask application app = Flask(__name__) #FlaskInstrumentor().instrument_app(app) # Logging setup gunicorn_logger = logging.getLogger('gunicorn.error') app.logger.handlers = gunicorn_logger.handlers app.logger.setLevel(gunicorn_logger.level) app.logger.info(f'Otel agent endpoint: {OTEL_AGENT_ENDPOINT}') @app.route("/") def hello(): app.logger.info("Received hello request !") requests_counter.add(1, staging_labels) app.logger.debug("Counter was incremented.") return "Hello World!" if __name__ == "__main__": app.run(host="0.0.0.0", port=config.PORT, debug=config.DEBUG_MODE)
Looking at the OpenCensusSpanExporter code https://github.com/open-telemetry/opentelemetry-python/blob/master/ext/opentelemetry-ext-opencensusexporter/src/opentelemetry/ext/opencensusexporter/trace_exporter/__init__.py it doesn't look like it supports exporting info about Resources at all. So while you're passing that resources info to the exporter, its just not doing anything with it.
@ocervell It has to be added to the exporter as well. I'm adding into #679
Olivier, is it possible for the ocagent to detect the GKE resource for you?
@aabmass I've tried the k8sprocessor
(https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/processor/k8sprocessor) without much luck yet.
There is another way to get GKE labels using Prometheus scrape config. But imo the best would be for autodetection to be baked in the SDK so that there is not much relabeling to do using OpenTelemetry processors (hard to configure / test for now), and consistency in resource labels could be achieved this way.
@ocervell, updated #937 to send the resources. It should work along with Andrew's resource detector. However, I'm not sure if the Collector will convert these into the correct GKE monitored resource types when it writes to Cloud Monitoring.
@nilebox, is there a way to get the OT Collector to send the correct GKE resource when running as an agent?
@nilebox, is there a way to get the OT Collector to send the correct GKE resource when running as an agent?
There is a Resource Detection Processor which currently supports detection of GCE, and can be extended to detect GKE as well.
@james-bebbington wrote this processor, and he is also the author of the Automatic Resource Detection proposal, so he would be the best person to suggest the recommended approach.
I've opened an issue for adding GKE detection to the Resource Detection Processor in the collector.
@aabmass I will try #937 shortly and give you some feedback.
Still seeing the resource type set as global
in Cloud Monitoring, did not seem to change anything. Code I'm using is available in gunicorn-opentelemetry-poc/ot-agent-collector-dev.
I dug into the config a little bit. Looks like the stackdriver exporter internally uses the OpenCensus Go exporter. The resource mapping seems to happen here: https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/v0.13.2/resource.go#L185
It looks at the generic labels for k8s (which aren't currently sent at all), but Andrew pointed out it is not checking for gcp.resource_type": "gke_container"
.
I ran your example to see if the python exporter is at least sending the resources correctly, and it looks right to me (on my cloudtop which is just GCE):
resource {
labels {
key: "cloud.account.id"
value: "some_account"
}
labels {
key: "cloud.provider"
value: "gcp"
}
labels {
key: "cloud.zone"
value: "us-east1-b"
}
labels {
key: "gcp.resource_type"
value: "gce_instance"
}
labels {
key: "host.id"
value: "1319926662905081598"
}
}
Rather than updating the OpenCensus exporter, I think the Collector's stackdriver exporter should be switched over to use the Go SDK GCP exporters in GoogleCloudPlatform/opentelemetry-operations-go, which are actively developed. However, even that SDK doesn't currently send the right GKE Monitored Resource from what I can tell.
Yep, I also have gotten it to work. However, on GKE the resource type is set by default to gce_instance
.
>>> from gke_detector import GoogleCloudResourceDetector
>>> c = GoogleCloudResourceDetector(
... )
>>> c
<gke_detector.GoogleCloudResourceDetector object at 0x7fc4282747b8>
>>> c.detect()
<opentelemetry.sdk.resources.Resource object at 0x7fc428ea2358>
>>> c.gcp_resources
{'cloud.account.id': 'rnm-datadog-sd', 'cloud.provider': 'gcp', 'cloud.zone': 'europe-west1-c', 'host.id': 107782034669724043, 'gcp.resource_type': 'gce_instance'}
I found out I had to set an env variable CONTAINER_NAME
in my k8s YAML in order for it to set the right type gke_container
:
- name: CONTAINER_NAME # required by OpenTelemetry GKE detector
value: flask-app-tutorial
Additionally, it would be great to have the GCE/GKE resource detectors available as part of the SDK, rather than in the Cloud Monitoring exporter. I've copied the detector code to my app because the extra dependency is not worth it.
I found out I had to set an env variable CONTAINER_NAME in my k8s YAML in order for it to set the right type gke_container:
@AndrewAXue is that intended? I thought this was set automatically by k8s
Additionally, it would be great to have the GCE/GKE resource detectors available as part of the SDK, rather than in the Cloud Monitoring exporter.
We are waiting for the spec; if OT community decides this is ok, we will include it!
I've copied the detector code to my app because the extra dependency is not worth it.
We are planning a single package for "tools" that will have the GCP propagator and resource detector. Is is actually that much of a burden to have one extra package?
Yes that's intended. We rely on the user manually passing in some info via the Downward API including populating CONTAINER_NAME
. It'll look something like this
apiVersion: "apps/v1"
kind: "Deployment"
metadata:
name: "food-fish"
spec:
replicas: 3
selector:
matchLabels:
app: "food-find"
template:
metadata:
labels:
app: "food-find"
spec:
terminationGracePeriodSeconds: 30
containers:
- name: "food-finder"
image: "gcr.io/aaxue-gke/food-finder:v1"
imagePullPolicy: "Always"
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CONTAINER_NAME
value: "food-finder"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
ports:
- containerPort: 8080
hostNetwork: true
dnsPolicy: Default
I realized that this was poorly documenting it and will be making a PR adding this info to our documents.
We are planning a single package for "tools" that will have the GCP propagator and resource detector. Is is actually that much of a burden to have one extra package?
No, this is fine as long as the package just does this, and doesn't have too many other dependencies.
I realized that this was poorly documenting it and will be making a PR adding this info to our documents.
Thanks, is there no way to get container name from the container directly ?
I've added the CONTAINER_NAME
variable but my metrics are still tagged with the resource gce_instance
. Still investigating why it doesn't set it to gke_container
, will check back later !
You'll also need to populate NAMESPACE
and POD_NAME
like in the example above. No there isn't a way to get container name from the container directly :/ that's why we rely on the Downward API. If it doesn't work, can you try printing the output of
from gke_detector import get_gke_resources
print(get_gke_resources())
@AndrewAXue okay it's working from the container after setting the CONTAINER_NAME
variable:
>>> print(get_gke_resources())
{'cloud.account.id': 'rnm-datadog-sd', 'cloud.provider': 'gcp', 'cloud.zone': 'europe-west1-c', 'k8s.cluster.name': 'demo-flask', 'k8s.nam
espace.name': 'default', 'k8s.pod.name': 'flask-app-tutorial-7ccc4d757b-tqqdg', 'host.id': 107782034669724043, 'container.name': 'flask-ap
p-tutorial', 'gcp.resource_type': 'gke_container'}
Let's add the needed env variables to our documentation.
Weirdly, the end metrics resource type in Stackdriver is still gce_instance
and I end up losing the GKE labels anyway:
'resource': {'labels': {'instance_id': '107782034669724043',
'project_id': 'rnm-datadog-sd',
'zone': 'europe-west1-c'},
'type': 'gce_instance'},
so I guess there is another problem down-the-line:gke_container
resource is replaced by gce_instance
in the collector, maybe because of what @aabmass pointed out with the OpenCensus Go package ?
OT Collector logs:
InstrumentationLibraryMetrics #0
ResourceMetrics #1
Resource labels:
-> service.name: STRING(flask-app-tutorial)
-> opencensus.starttime: STRING(2020-07-29T21:35:20.026248448Z)
-> host.hostname: STRING(flask-app-tutorial-7ccc4d757b-k7dsn)
-> opencensus.pid: INT(8)
-> telemetry.sdk.version: STRING(0.12.dev0)
-> opencensus.exporterversion: STRING(0.11.dev0)
-> telemetry.sdk.language: STRING(PYTHON)
-> host.id: STRING(107782034669724043)
-> cloud.zone: STRING(europe-west1-c)
-> cloud.account.id: STRING(rnm-datadog-sd)
-> k8s.pod.name: STRING(flask-app-tutorial-7ccc4d757b-k7dsn)
-> gcp.resource_type: STRING(k8s_container)
-> container.name: STRING(flask-app-tutorial)
-> k8s.cluster.name: STRING(demo-flask)
-> cloud.provider: STRING(gcp)
-> k8s.namespace.name: STRING(default)
InstrumentationLibraryMetrics #0
Metric #0
Descriptor:
-> Name: flask_app_hello_requests
-> Description: Hello requests count
-> Unit: 1
-> Type: MONOTONIC_INT64
Int64DataPoints #0
Data point labels:
-> environment: staging
-> pid: 8
StartTime: 1596058520026277000
Timestamp: 1596058910715795712
Value: 3075
Weirdly, the end metrics resource type in Stackdriver is still gce_instance and I end up losing the GKE labels anyway
@ocervell Where do you process gcp.resource_type: k8s_container
? Note that the resource type should be set to container
(not k8s_container
) initially, and Stackdriver exporter (in OpenTelemetry Collector) will recognize and convert it to k8s_container
.
@nilebox, thanks I need to update my PR to send container
in that case.
In that code though, I think that is still the incorrect monitored resource type. Going off what Andrew has used in the python exporter, should be:
- result.Type = "k8s_container"
+ result.Type = "gke_container"
Does it make sense to update the opencensus-go-exporter-stackdriver package with this? @james-bebbington suggested writing a MapResource
function in the collector instead. Thoughts?
@aabmass Pretty sure that the container
resource type in Go is being used in quite a few projects using OpenCensus and OpenTelemetry Collector, so that would be a breaking change. Changing MapResource
implementation in the OT collector would also be a breaking change for GKE Metrics Agent, for example.
FWIW, the Python implementation for OpenCensus seems to be using k8s_container
: https://github.com/census-instrumentation/opencensus-python/blob/f4d8ec36cada8b385821a16795f27fe322694a1a/opencensus/common/monitored_resource/monitored_resource.py#L24
I'm surprised that this inconsistency wasn't a problem in OpenCensus before.
For backward compatibility, we may just recognize all variations (container
, k8s_container
and gke_container
) and translate them to k8s_container
in opencensus-go-exporter-stackdriver exporter. @james-bebbington WDYT?
For backward compatibility, we may just recognize all variations (
container
,k8s_container
andgke_container
) and translate them tok8s_container
in opencensus-go-exporter-stackdriver exporter. @\james-bebbington WDYT?
@nilebox Checking the OpenCensus resource semantic conventions for possible values of type
, it looks like we should be setting type: k8s
? It appears container
is for plain docker images and it says "this resource can be merged with a deployment service resource, a compute instance resource, and an environment resource." Then for merging, the spec says "already set labels or type fields MUST NOT be overwritten unless they are empty string."
Presumably if the docker container gets detected first, you'd get type: container
, then merging with k8s resource would NOT overwrite type
, even though type: k8s
would be more descriptive? I will change my PR to send "container" if k8s for now. But it seems wrong that if you were running just a docker image, the Collector's exporter would send a k8s_container
monitored resource.
it looks like we should be setting type: k8s?
k8s
is for Pod, see https://github.com/census-ecosystem/opencensus-go-exporter-stackdriver/blob/366afe7b1d81a1faaa8c58e725aaeb77344b343e/resource.go#L198-L199
It appears container is for plain docker images
A single Pod may run multiple Docker containers (see PodSpec), so if a metric is emitted for specific container, it may make sense to use container
there even if it's running as part of a Pod.
As an example, you may be interested in emitting CPU/RAM utilization metrics for each container separately rather than for Pod as a whole.
In Kubernetes, resource limits are specified per container, but Pods are allocated based on the sum of all requested resources, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-requests-and-limits-of-pod-and-container
So when one of the containers has a memory leak, for example, you would be interested to track it separately from other containers in the Pod.
I'll try to summarise the above to hopefully clear things up.
Due to the nature of the progressive changes to the OC/OT data models, we have gotten to the point where there are several ways to represent a Resource:
No | Model | Example of how to define "Container" resource type |
---|---|---|
1 | OpenCensus resource model (includes OC Type) | Type: container |
2 | OpenTelemetry resource model (no Type; Type needs to be inferred from which attribute exist1) | Existence of any attribute with prefix container. |
32 | OpenTelemetry resource model with attribute gcp.resource_type (hack currently added in Python only to represent resource type explicitly in OpenTelemetry) |
Attribute: gcp.resource_type = k8s_container |
4 | OpenTelemetry resource model with attribute opencensus.resourcetype (hack to convert between OC & OT models used internally by the Collector) |
Attribute: opencensus.resourcetype = container |
5 | Cloud Monitoring resource model (includes CM Type) | Type: k8s_container 3 |
1 This is so OpenTelemetry can have a more expressive hierarchical set of resource info instead of being limited to a specific "type" (e.g. process info + container info + pod info + k8s info + cloud info).
2 I would suggest we consider removing (3) to simplify this situation slightly if possible (I recall @AndrewAXue having some reason why this might not be possible).
3 I'm not really sure what the distinction is between gke_container
& k8s_container
- looks like we have agreed to ignore gke_container
for now since it was not supported by OC (this seems reasonable but I also recommend confirming with Cloud Monitoring team)
We then need to support at least the following paths of exporting data to Cloud Monitoring4:
type
is not set correctly) -> 5 (after we complete migration of Collector exporter to OT, this will become the same as the point above)Points 1 & 2 are already implemented (although note this is inconsistent due to the gcp.resource_type
hack). We need to add code in the Collector exporter wrapper that converts from OT type (based on the existence of attributes or on gcp.resource_type
depending on which approach we prefer) to OC type. The OC exporter will then not need to be changed. A similar alternative would be to create a MapResource function in the collector to map directly from OT resource -> CM resource and bypass the OC exporter resource mapping logic as Aaron mentioned above. In either case this would need to be backwards compatible (support the OC type being set directly in OC as the first option, or derived from OT attributes as a fallback).
The other option is to just fast forward to using the new OT exporter in the Collector (and verify this works in a way that is backwards compatible with OC) but given this code path is used heavily in production we need to be careful with this rollout and imo should wait until the OTLP proto is finalized.
4 There are other possible pipelines such as when the Collector translates between OC & OT models during its internal pipeline, but I'm pretty sure this will be handled without issue
TLDR Recommendations:
Fix on the OpenTelemetry Collector side: https://github.com/open-telemetry/opentelemetry-collector/pull/1462
Thanks for the detailed writeup, I think it made it clear to everyone, even those without context on the Collectors (me) what the issue is. I'm a proponent of Model 3 for 2 reasons:
So it seems that Models 2, 3 are are only used in the OT SDK -> Cloud Monitoring and OT SDK -> Collector routes. The former supports model 3, so it seems the only problem is building a good mapping from 3 -> 1 for the collector. We can keep Model 2 as a backup in case Model 3 isn't there for whatever reason, but I think the risk of ambiguity should make us want to avoid it as much as possible.
A single Pod may run multiple Docker containers (see PodSpec), so if a metric is emitted for specific container, it may make sense to use
container
there even if it's running as part of a Pod.
@nilebox that makes sense, but I don't see how type container
necessarily means k8s. The stackdriver exporter would set result.Type = "k8s_container"
in several cases that aren't GKE, but still send type container
? E.g. Containers on GCE, maybe other GCP environments using Docker
Also, looking at open-telemetry/opentelemetry-collector#1462, should I just continue to send a blank type
field in this PR and let the Collector infer it instead?
The stackdriver exporter would set result.Type = "k8s_container" in several cases that aren't GKE, but still send type container?
It will try and fail to do so, because it also requires all k8s_container
labels to be present, including e.g. k8s.pod.name
, which will surely be missing for GCE container.
So once it fails to treat resource as k8s_container
, Stackdriver exporter will just fallback to the default "global" type.
should I just continue to send a blank type field
Left a comment, you should still set resource type for OpenCensus exporter, but it's fine to omit it in OTLP exporter thanks to the change in collector.
So it seems that Models 2, 3 are are only used in the OT SDK -> Cloud Monitoring and OT SDK -> Collector routes. The former supports model 3, so it seems the only problem is building a good mapping from 3 -> 1 for the collector. We can keep Model 2 as a backup in case Model 3 isn't there for whatever reason, but I think the risk of ambiguity should make us want to avoid it as much as possible.
The problem, from my point of view, is that Model 3 is not defined in the OT spec. We could consider proposing a change to the spec so there is a standard resource attribute key called type
to support vendors with a resource model similar to Cloud Monitoring, and then use that. But otherwise we are adding potential confusion - consider for example users who wants to set the resource type manually using the OT conventions and have this work across Vendors (that is kind of the goal of OT).
I also don't think there is as much ambiguity as you think. I don't believe there are any cases where we cannot infer the correct Cloud Monitoring resource type from OT semantic conventions, and if there were it would just imply we need to add additional semantic conventions (doesn't have to be a 'type' property).
I'm not sure if this will help, but here is what the stackdriver-prometheus-sidecar
uses: https://github.com/Stackdriver/stackdriver-prometheus-sidecar/blob/master/retrieval/resource_map.go.
I haven't had issues ingesting metrics using Prometheus + the sidecar (resource type is set correctly all the time). Maybe we could inspire ourselves from this detection logic ?
When running on GKE, the metrics exported with the
OpenCensusMetricsExporter
do not have any GKE resource labels (pod_name
,container_name
,node_name
, etc...). Should those be added to the resource labels before shipping the metric with the exporters ?