signalfx / splunk-otel-collector

Apache License 2.0
191 stars 153 forks source link

how to fetch kubernetes volumes metrics #322

Closed xp-1000 closed 3 years ago

xp-1000 commented 3 years ago

Hello,

I am trying to replace signalfx smart agent by otel collector since its deprecation.

I started with a simple configuration for host, kubernetes and nginx metrics.

exporters:
  sapm:
    access_token: ${SPLUNK_ACCESS_TOKEN}
    endpoint: https://ingest.eu0.signalfx.com/v2/trace
  signalfx:
    access_token: ${SPLUNK_ACCESS_TOKEN}
    api_url: https://api.eu0.signalfx.com
    correlation: null
    ingest_url: https://ingest.eu0.signalfx.com
    sync_host_metadata: true
extensions:
  health_check: null
  k8s_observer:
    auth_type: serviceAccount
    node: ${K8S_NODE_NAME}
  zpages: null
processors:
  batch: null
  k8s_tagger:
    extract:
      metadata:
      - namespace
      - node
      - podName
      - podUID
    filter:
      node_from_env_var: K8S_NODE_NAME
  memory_limiter:
    ballast_size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
    check_interval: 5s
    limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}
  resource:
    attributes:
    - action: insert
      key: host.name
      value: ${K8S_NODE_NAME}
    - action: insert
      key: k8s.node.name
      value: ${K8S_NODE_NAME}
    - action: insert
      key: k8s.cluster.name
      value: kubeoteltest
  resource/add_agent_k8s:
    attributes:
    - action: insert
      key: k8s.pod.name
      value: ${K8S_POD_NAME}
    - action: insert
      key: k8s.pod.uid
      value: ${K8S_POD_UID}
    - action: insert
      key: k8s.namespace.name
      value: ${K8S_NAMESPACE}
  resourcedetection:
    detectors:
    - system
    - env
    override: false
    timeout: 10s
receivers:
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu: null
      disk: null
      filesystem: null
      load: null
      memory: null
      network: null
      paging: null
      processes: null
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
  kubeletstats:
    auth_type: serviceAccount
    collection_interval: 10s
    endpoint: ${K8S_NODE_IP}:10250
    extra_metadata_labels:
    - container.id
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:55681
  prometheus/agent:
    config:
      scrape_configs:
      - job_name: otel-agent
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${K8S_POD_IP}:8888
          - ${K8S_POD_IP}:24231
  receiver_creator:
    receivers:
      prometheus_simple:
        config:
          endpoint: '`endpoint`:`"prometheus.io/port" in annotations ? annotations["prometheus.io/port"] : 9090`'
          metrics_path: '`"prometheus.io/path" in annotations ? annotations["prometheus.io/path"] : "/metrics"`'
        rule: type == "pod" && annotations["prometheus.io/scrape"] == "true"
      smartagent/nginx:
        config:
          type: collectd/nginx
        rule: type == "port" && pod.name matches "nginx" && port == 80
    watch_observers:
    - k8s_observer
  sapm:
    endpoint: 0.0.0.0:7276
  smartagent/kubernetes_volumes:
    kubeletAPI:
      url: https://${K8S_NODE_IP}:10250
    type: kubernetes-volumes
  smartagent/signalfx-forwarder:
    listenAddress: 0.0.0.0:9080
    type: signalfx-forwarder
  zipkin:
    endpoint: 0.0.0.0:9411
service:
  extensions:
  - health_check
  - k8s_observer
  - zpages
  pipelines:
    metrics:
      exporters:
      - signalfx
      processors:
      - memory_limiter
      - batch
      - resource
      - resourcedetection
      receivers:
      - receiver_creator
      - hostmetrics
      - kubeletstats
      - smartagent/kubernetes_volumes
    metrics/agent:
      exporters:
      - signalfx
      processors:
      - memory_limiter
      - batch
      - resource
      - resource/add_agent_k8s
      - resourcedetection
      receivers:
      - prometheus/agent
    traces:
      exporters:
      - sapm
      - signalfx
      processors:
      - memory_limiter
      - k8s_tagger
      - batch
      - resource
      - resourcedetection
      receivers:
      - otlp
      - jaeger
      - smartagent/signalfx-forwarder
      - zipkin

Events:  <none>

It is not clear if native kubeletstatsreceiver and k8sclusterreceiver from otel collector fully replaces all the kube* smart agent monitors (e.g. kubernetes-events or kubernetes-volumes ..).

For volumes, according to the documentation: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/kubeletstatsreceiver#metric-groups it supports volumes metrics.

I see most of the metrics from kubernetes-cluster and kubelet-metrics coming to Splunk Observability expect for volume metrics: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/kubelet/volume.go#L27. However, as you can see in the configuration, metric_groups is not defined so it should fetch all metrics including volumes.

Not a big deal given that this receiver is still in beta and the goal of https://github.com/signalfx/splunk-otel-collector/tree/main/internal/receiver/smartagentreceiver is also to be able to keep existing working smart agent monitors waiting for full replacement by opentelemetry collector. So I tried to configure the original https://docs.signalfx.com/en/latest/integrations/agent/monitors/kubernetes-volumes.html monitor.

Sadly, with the previously shared configuration I got the following error:

2021-04-22T17:33:37.356Z    ERROR    volumes/volumes.go:65    Could not get volume metrics    {"kind": "receiver", "name": "smartagent/kubernetes_volumes", "monitorType": "kubernetes-volumes", "error": "failed to get summary stats from Kubelet URL \"https://10.0.2.4:10250/stats/summary/\": kubelet request failed - \"401 Unauthorized\", response: \"Unauthorized\""}

So I updated to kubernetes volumes configuration fragment to:

      smartagent/kubernetes_volumes:
        type: kubernetes-volumes
        kubeletAPI:
          authType: serviceAccount
          url: https://${K8S_NODE_IP}:10250

But now I get following errors:

Error: cannot load configuration: error reading receivers configuration for smartagent/kubernetes_volumes: failed creating Smart Agent Monitor custom config: yaml: unmarshal errors:      
   line 2: field authtype not found in type kubelet.APIConfig                                                                                                                               
 2021/04/22 17:51:23 main.go:74: application run finished with error: cannot load configuration: error reading receivers configuration for smartagent/kubernetes_volumes: failed creating
   line 2: field authtype not found in type kubelet.APIConfig                                                                                                                               

According to https://docs.signalfx.com/en/latest/integrations/agent/monitors/kubernetes-volumes.html#configuration authType is a valid configuration for kubeletAPI nested block so my configuration seems correct.

So maybe it is an "exception" of monitor not compatible with smart agent receiver?

thanks for your help

asuresh4 commented 3 years ago

Hi @xp-1000, as stated in https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/kubeletstatsreceiver#metric-groups , volume metrics are not collected by default by the kubeletstats receiver (only container, pod and node metrics are collected). You'll need to explicitly override the metric_groups option with

    metric_groups:
      - node
      - pod
      - container
      - volume

Can you try making the above change to your kubeletstats receiver config?

If you tried the https://github.com/signalfx/splunk-otel-collector-chart, I believe we currently do not have an option to turn these metrics on in the chart. I've created https://github.com/signalfx/splunk-otel-collector-chart/issues/111 for this.

In terms of parity, the k8scluster receiver collects the metrics collected by the kubernetes-cluster monitor and the kubeletstats receiver collects metrics collected by the kubelet-metrics and kubernetes-volumes monitors.

xp-1000 commented 3 years ago

Hello @asuresh4 thanks for your answer.

adding volume to metric_groups is the first thing I tried, sorry I forgot to speak about that.

xp-1000 commented 3 years ago

oh my bad, just checked again into signalfx metric explorer and indeed some metrics are present on my otel test env, they just changed their name :

But I still don't see the inode related metrics (with the new or old name) when I use Otel collector even if it seems supported: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/kubelet/volume.go#L29 (I don't see any errors on the logs but I confess I don't know how to troubleshoot metrics gathering as it was possible with signalfx-agent tap-dps for example).

In any case I am still worry about the fact we cannot configure authType on kubeletAPI, this make the usage of smart agent monitors impossible and I am pretty sure we will meet different cases / metrics without equal parity between smartagent and otel collector.

For example, after volumes which is critical I would like to configure kubernetes-event which will have the same problem (and this time there is no equivalent in otel collector).

asuresh4 commented 3 years ago

@xp-1000 - you should be able to use the old or new names interchangeably, i.e., even though the collector sends new metrics, you should be able to search for old metrics in the UI. If this is not the case please open a support ticket with information such as realm/org and we will be able to help address that.

By default the OTel collector only emits metrics that are classified as default by the respective Smart Agent monitor. For monitors that have already been ported to OpenTelemetry such as hostmetrics, k8scluster, kubeletstats receivers, this is currently controlled by the signalfx exporter. To include metrics that are non-default you can make use of the include_metrics option on the exporter.

    include_metrics:
      - metric_names: [k8s.volume.inodes, k8s.volume.inodes.free, k8s.volume.inodes.used] 

I will get back to you on configuring authType using the smartagent receiver after some more investigation.

xp-1000 commented 3 years ago

Hello @asuresh4 thank you very much for your help.

I am sorry I must have missed this information about extraMetrics (but this was one of my question) I will test now and come back to confirm but it makes sens!

if a newbie as me read this and try to replace smart agent by otel collector I think the documentation available on https://docs.splunk.com/Observability/get-started/migrate/migrate-to-otel.html#nav-Replace-the-SignalFx-Smart-Agent-with-the-Splunk-Distribution-of-OpenTelemetry-Collector is maybe not enough (at least for me) so I recommend to check the following good docs on github:

The last one contains everything I missed until now (thanks @asuresh4).

xp-1000 commented 3 years ago

OK so I confirm it works fine with the following configuration:

  config:
    receivers:
      receiver_creator:
        receivers:
          smartagent/nginx:
            rule: type == "port" && pod.name matches "nginx" && port == 80
            config:
              type: collectd/nginx
      kubeletstats:
        metric_groups:
          - node
          - pod
          - volume
    exporters:
      signalfx:
        include_metrics:
          - metric_names:
            - k8s.job.desired_successful_pods # kubernetes.job.completions
            - k8s.job.active_pods # kubernetes.job.active
            - k8s.job.successful_pods # kubernetes.job.succeeded
            - k8s.statefulset.ready_pods # kubernetes.stateful_set.ready
            - k8s.statefulset.desired_pods # kubernetes.stateful_set.desired
            - k8s.hpa.max_replicas # kubernetes.hpa.spec.max_replicas
            - k8s.hpa.desired_replicas # kubernetes.hpa.status.desired_replicas
            - k8s.volume.inodes.free # kubernetes.volume_inodes_free
            - k8s.volume.inodes # kubernetes.volume_inodes
    service:
      pipelines:
        metrics:
          receivers:
            - kubeletstats
            - receiver_creator
    extensions:
      zpages:
        endpoint: 0.0.0.0:55679

I see inode related metrics in the metric finder using the new opentelemetry naming. The finder cannot find the metric with the old signalfx naming (this is why I did not find them in first place) BUT you are right @asuresh4 the old signalfx metric name still work if you set it manually in signalflow (it is just not listed the metric finder).

Now thanks to your help, I think I have only one remaining problem about dimensions parity:

image

image

as you can see, some kubernetes metrics have the "old" signalfx dimension kubernetes_cluster in addition to the new one k8s.cluster.name but this is not case of the volume metrics for example.

this is a problem for us because we maintain a list of terraform modules for signalfx "template" detectors and we rely on metrics names but also on their dimensions. While the metrics names are translated we can hope to keep these base of detectors working with both smart agent and otel collector to avoid a disruptive migration but we also need the dimension parity.

should I try to use dimensionClients to keep old dimensions ? I confess it is not fully clear for me this part.

thanks !

xp-1000 commented 3 years ago

Ok I added :

          - action: rename_dimension_keys
            mapping:
              k8s.cluster.name: kubernetes_cluster

in the signalfx export configuration and now I get my metrics with the "old" dimension kubernetes_cluster.

With this setup all of our detectors will work for both smart agent AND otel collector without any change to them (e.g. for kubernetes: https://github.com/claranet/terraform-signalfx-detectors/tree/master/modules/smart-agent_kubernetes-common).

I think the other modules will be easier while we use the original smart agent monitors (and not a new otel exporter like for hostmetrics or kubernetes).

Now I will try to make the kubernetes-events works but I am afraid it will have the same problem than kubernetes_volume on the kubeapi configuration (that said this less critical given that this is not a dependency for detectors/alerts).

asuresh4 commented 3 years ago

@xp-1000 - I did a quick test and it appears, both the new and old dimensions are available. However, it does not surface in the suggestions like the issue you saw with metrics in metric finder (renamed metrics not showing up), but you should be able to provide the filter using signalflow. You shouldn't need the additional dimension remapping on the collector. On checking with the relevant team, I've learnt that this is a known limitation. I would recommend you also open a support ticket highlighting you've run into this.

I've opened #345, which I believe will fix the issue you're seeing with the kubeletAPI config.

@xp-1000 thank you for the feedback on the gaps in docs for migration. I realize that we don't have information around monitors that have already been ported over to OTel (such as hostmetrics, k8scluster, kubeletstats) for advanced configurations that give parity with the respective agent monitors. I'll work on adding some of this information. Please also let us know of other difficulties you've run into while trying to replace the Smart Agent. cc @rmfitzpatrick

xp-1000 commented 3 years ago

I did a quick test and it appears, both the new and old dimensions are available. However, it does not surface in the suggestions like the issue you saw with metrics in metric finder

Oh indeed, again I am bad ^^ it seems applying my "transformation rule" fix the suggestion filter but in true the dimension is here you are right.

I've opened #345, which I believe will fix the issue you're seeing with the kubeletAPI config.

awesome thanks I will test (it will probable unstuck me for the kubernetes-events monitor

Please also let us know of other difficulties you've run into while trying to replace the Smart Agent. it will be a pleasure, I am afraid I will not be able to make pull request contributions like I did on smart agent until I take the hand on this otel collector which remains pretty new for me but I can at least report issues ;)

Your precious help is highly appreciated thanks ! I think you can close this issue.