open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
277 stars 175 forks source link

Add missing ECS cloud fields to Semantic Conventions Cloud Resource attributes #761

Open mlunadia opened 9 months ago

mlunadia commented 9 months ago

What

This issue proposes adding cloud-related fields from the Elastic Common Schema (ECS) which are not in the OpenTelemetry Semantic Conventions specification for Cloud Resource Attributes.

Why

These fields provide valuable context, enabling a better understanding and analysis of application performance and behaviour across cloud environments. Analyse performance differences based on cloud configuration (e.g., account name for companies using multiple accounts, machine type to help understand related performance and cost, etc.), and better understand the impact of cloud infrastructure on application behaviour.

List of fields proposed for addition

Attribute Type Description Examples
cloud.account.name string Cloud account name/alias elastic-dev
cloud.instance.id string Instance ID i-1234567890abcdef0
cloud.instance.name string Instance name jenkins-1
cloud.machine.type string Machine type t2.medium
cloud.project.id string Cloud project identifier my-project
cloud.project.name string Cloud project name project
cloud.service.name string Cloud service name ec2

This PR (currently closed) implements this issue.

pyohannes commented 9 months ago

How would cloud.instance.id, cloud.instance.name, and cloud.machine.type relate to host.id, host.name, and host.type?

The description of host.type currently says:

For Cloud, this must be the machine type.

mlunadia commented 9 months ago

Good points @pyohannes, due to the plain field structure in ECS, it made sense to add them all but as the below pairs might be mutually exclusive we can consider removing them from the PR.

cloud.instance.id - host.id cloud.instance.name - host.name cloud.machine.type - host.type

cc: @mx-psi @ChrsMark @frzifus @dineshg13 @braydonk who worked on the system semantic conventions for comment.

kaiyan-sheng commented 8 months ago

Agree! Thanks @pyohannes! We are already using host.id and host.name in our cloud provider monitoring solutions. host.type also makes total sense in this case!

mx-psi commented 8 months ago

I think removing cloud.instance.id, cloud.instance.name, and cloud.machine.type from the list makes sense

ChrsMark commented 8 months ago

This seems to also be relevant to https://github.com/open-telemetry/semantic-conventions/pull/576, https://github.com/open-telemetry/semantic-conventions/issues/739 and https://github.com/open-telemetry/semantic-conventions/pull/600

In general I like the idea of re-using the host.* attributes but on the other hand I find it difficult to control this overloading approach.

For example we already have the gcp.gce.instance.name but how is this different to host.name? If there is need to have specific attributes per provider then it would be more future proof to have a unified one called cloud.instance.name right?

Also we should be very specific on how we leverage the resource hierarchy here. For example in a Kubernetes world environment, host.name can take 3 different values depending on the Collector config:

  processors:
    resourcedetection/system:
      detectors: [ "system" ]
      system:
        hostname_sources: [ "lookup", "cname", "dns", "os" ]
        resource_attributes:
          host.name:
            enabled: true
    resourcedetection/gcp:
      detectors: [ env, gcp ]
      timeout: 2s
      override: false

a) if we add the gcp resource detector with override: true it will be the name of the GCP machine. b) If we run the Collector as Pod with hostNetwork: true then the value is the name of the k8s Node (==GCP node) + the dnsdomainname. c) If we run the Collector as Pod with hostNetwork: false then the value is the name of the Pod.

host name

This can be very confusing for the users, specially when it comes to multi-tenant infrastructures with multiple teams running multiple Collector's instances per org/team/namespace.

So to my mind we should be very specific here and either: 1) introduce the cloud.* specific values to ensure we don't mix things and have our users end up with misleading outcomes. This is mostly based on the idea @jsuereth proposed at https://github.com/open-telemetry/semantic-conventions/pull/600#discussion_r1506377547 if I'm not mistaken. 2) or re-use the host.* with very strict guidance on what that means for the implementations on cloud envs.

Let me know what you folks think or if I miss something here.