open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
253 stars 165 forks source link

FaaS metrics - Not clear which attributes should be added to each metric #86

Open joaopgrassi opened 1 year ago

joaopgrassi commented 1 year ago

The current FaaS metric semantic conventions list several metrics, and then lists the attributes .

I'm working on moving the metrics to YAML, and it is not clear which attributes should be added to each metric.

Before the attribute table, it is stated:

Below is a table of the attributes to be included on FaaS metric events

then after, there's this statement:

Outgoing FaaS invocations are identified using the faas.invoked* attributes above. faas.trigger SHOULD be included in all metric events while faas.invoked* attributes apply on outgoing FaaS invocation events only.

It says the attributes faas.invoked_* should be included in "outgoing FaaS invocation events only" but then it says "Outgoing FaaS invocations are identified using the faas.invoked_*". The metric table does not list what attributes are to be added, so it's unclear which metric is "outgoing" and which are not.

In the FaaS span semconv we have more info:

Incoming invocations

This section describes incoming FaaS invocations as they are reported by the FaaS instance itself. For incoming FaaS spans, the span kind MUST be Server.

Outgoing invocations

This section describes outgoing FaaS invocations as they are reported by a client calling a FaaS instance. For outgoing FaaS spans, the span kind MUST be Client

I looked at the existing metrics and after reading all, my "gut feeling" is:

CC @skonto since you were the original author, could you shed some light on this?

Oberon00 commented 1 year ago

faas.invoke_duration : the client invoking the function knows how long it took

I think purely from the naming I would agree, but then the important duration of the server-side execution (which was renamed to invocation to reduce confusion https://github.com/open-telemetry/opentelemetry-specification/pull/3209) is missing. OTOH, it does seem like a bad idea to have server and client execution/invocation duration under the same metric key, and one of them is definitely missing

Oberon00 commented 1 year ago

It may be sensible to drop "outgoing FaaS" entirely from metrics. AFAIK this concept is only really supported by AWS Lambda which has a Invoke REST API as the lowest level public API to invoke a Lambda. Other cloud vendors seem to not offer such an API, or only offer it in a very limited way for debugging only (GCP). On other providers (Azure, GCF), you would make an ordinary HTTP request to invoke the function

lmolkova commented 1 year ago

faas.timeouts, faas.errors

seems relevant to a status attribute discussion (and then the timeout/error is reported as a status code and not an individual metric) - https://github.com/open-telemetry/opentelemetry-specification/issues/3243

faas.invocations

It seems it can be derived from faas.invoke_duration histogram, is this metric is necessary at all?

joaopgrassi commented 1 year ago

faas.invocations

It seems it can be derived from faas.invoke_duration histogram, is this metric is necessary at all?

Good question. faas.invocations says: Number of successful invocations. If faas.invoke_duration also counts the number of successful ones (which I think it does/should) then I think you are right, it can be derived.