Semantic Conventions for File Attributes

djaglowski commented 3 years ago

When consuming logs from a file, information about the file is commonly included as metadata on the log record.

Should this information be standardized? If so:
What is the schema for this metadata?
Does the metadata belong in Resource or Attributes?

As a starting point for discussion, I propose:

Yes
Establish a file.* namespace:
- file.name - The basename of the file (i.e. mylog.log).
- file.path - The absolute path of the file (i.e. /var/log/mylog.log).
- file.name.resolved - Same as file.name, but with symlinks resolved.
- file.path.resolved - Same as file.name, but with symlinks resolved.
- file.stream - When relevant, stdout or stderr.
In the case of log ingestion, Attributes is appropriate. The file from which logs are being consumed is not the component that emitted the logs. Rather, it is a medium for conveying logs.

arminru commented 3 years ago

Hey @djaglowski!

I'd also say yes.
file.* is ambiguous. Here we'll have to discuss if this is only about log files or if this could be any kind of file and the "role" of the file would depend on the context. If it was a trace attribute for an operation reading from or writing to a file, this would be a different semantic than for a log line that was read from that file. 2a. file.name and file.path overlap. Is this intended or should we rather use the directory instead of full path? These details can, however, be best discussed on a PR.
We don't have any dedicated set of semantic conventions for log attributes yet but I assume it makes sense to have that. What do the other @open-telemetry/technical-committee members think? We'll also have to extend the semconv generator support to metrics anyway. Currently they're all hand-crafted markdown files with no YAML representation which works for now since they're only a subset of trace attributes but depending on the direction we're going here we might want to have separate definitions for logs and metrics attributes that aren't trace attributes or generally depart from that distinction.

tigrannajaryan commented 3 years ago

3. We don't have any dedicated set of semantic conventions for log attributes yet but I assume it makes sense to have that. What do the other @open-telemetry/technical-committee members think? We'll also have to extend the semconv generator support to metrics anyway.

I think we need to have one "semantic conventions" set of documents and label each individual semantic convention with up to 4 possible labels which indicate which signal/data type the convention is applicable to: resource, traces, metrics, logs. As of today most of logs conventions also apply to traces and vice versa.

tigrannajaryan commented 3 years ago

2. file.* is ambiguous. Here we'll have to discuss if this is only about log files or if this could be any kind of file and the "role" of the file would depend on the context.

Can we come up with a better name? FYI, Elastic Common Schema has a file.* namespace https://www.elastic.co/guide/en/ecs/current/ecs-file.html#ecs-file that has things like path, name, target_path which seem to serve a similar purpose.

If it was a trace attribute for an operation reading from or writing to a file, this would be a different semantic than for a log line that was read from that file.

Why would it be a different semantic? For example if I do some sort of file processing and want to report it as a span wouldn't it be a good fit to specify file.name as an attribute of the span?

Oberon00 commented 3 years ago

If these file conventions can be designed to be used in span and log attributes with the same semantics, that would be ideal IMHO, and the using the file namespace seems best.

tigrannajaryan commented 3 years ago

file.* is ambiguous. Here we'll have to discuss if this is only about log files or if this could be any kind of file and the "role" of the file would depend on the context.

Can we come up with a better name? FYI, Elastic Common Schema has a file.* namespace https://www.elastic.co/guide/en/ecs/current/ecs-file.html#ecs-file that has things like path, name, target_path which seem to serve a similar purpose.

If it was a trace attribute for an operation reading from or writing to a file, this would be a different semantic than for a log line that was read from that file.

Why would it be a different semantic? For example if I do some sort of file processing and want to report it as a span wouldn't it be a good fit to specify file.name as an attribute of the span?

@arminru can you please comment on this ^^^?

arminru commented 3 years ago

@tigrannajaryan

Can we come up with a better name? FYI, Elastic Common Schema has a file.* namespace elastic.co/guide/en/ecs/current/ecs-file.html#ecs-file that has things like path, name, target_path which seem to serve a similar purpose.

Why would it be a different semantic? For example if I do some sort of file processing and want to report it as a span wouldn't it be a good fit to specify file.name as an attribute of the span?

If we want to add this as a generic attribute which always describes the file about which a certain span, log or metric is about, then file.* should be fine indeed. I thought that we might want to distinguish between a file being operated on and the log file from which a log is coming from. If we, for example, had a structured log message about a failed file read operation and we extracted this log message from a log file, then we'd have two files in question and it would be ambiguous which one is described by the file.* attribute. Hence I thought we should probably add a separate attribute dedicated to a log file as source of a log record. WDYT?

djaglowski commented 3 years ago

Based on discussion in the Spec SIG, I'm suggesting the following approach:

First, the immediate problem can be considered as a logs-specific attribute set. This would mean prefixing the above attributes, such that we would have:

logsource.file.name - The basename of the file (i.e. mylog.log).
logsource.file.path - The absolute path of the file (i.e. /var/log/mylog.log).
logsource.file.name.resolved - Same as file.name, but with symlinks resolved.
logsource.file.path.resolved - Same as file.name, but with symlinks resolved.
logsource.file.stream - When relevant, stdout or stderr. The choice of logsource is an initial proposal, but of course is open to feedback.

Second, as a possible separate proposal, the notion of establishing a "structured value" should be explored. The general idea would be that a common structure could be established that is reusable in multiple contexts within the project's semantic conventions. In the context of a log source, the semantic convention would establish that the logsource attribute should have a structured value that is essentially:

fileType {
    file.name: string,
    file.path: string,
    file.name.resolved: string,
    file.path.resolved: string,
    file.stream: string,
}

A mechanism for defining and referring to structured values would then facilitate commonality across the semantic conventions by allowing reuse. Referencing this type in a specification would then logically produce a corresponding set of attributes. Logically:

some_context: fileType

would effectively define

some_context.file.name
some_context.file.path
some_context.file.name.resolved
some_context.file.path.resolved
some_context.file.stream

With this second point in mind, the logs-specific attribute set would ideally be defined in such as way that it is broadly applicable, such that an eventual structured value for describing a file could replace the initial logs-specific attributes in a non-breaking way. Of course this cannot be guaranteed, but we can make a point to consider whether these attributes would be broadly useful. I am satisfied that they are, but I'm calling this out in case anyone has further thoughts on this before we move forward.

tigrannajaryan commented 3 years ago

Second, as a possible separate proposal, the notion of establishing a "structured value" should be explored.

This likely requires an OTEP since it has significant implications (the Trace API currently disallows structured values, except the most simple ones - homogeneous arrays).

tigrannajaryan commented 3 years ago

I may have misunderstood what you wrote. Do you suggest that we do this (in JS notation for simplicity):

attributes["some_context"] = {
  "file.name": "abc",
  "file.path": "/var/lib/abc",
  ...
}

or this:

attributes["some_context.file.name"] = "abc"
attributes["some_context.file.path"] = "/var/lib/abc"
...

djaglowski commented 3 years ago

@tigrannajaryan I was imagining the latter, but really only because the existing conventions use string keys, as far as I've seen.

Having thought about it more (and a good thing - see below), I am even more in favor of the string-only approach. The rules for attribute naming, specifically the third and fourth, ensure that the two formats are effectively interchangeable. If I'm correct on that point then it seems we might as well decide based on simplicity and compatibility.

I would also say that I'm less convinced of the need for any kind of code-level support of structured values, though I still think it would be good to formalize some notion of what I would call "relative namespaces" (e.g. .file).

I'm really glad you asked this because thinking about the extent to which the two representations are almost interchangeable w/o any rules helped me recognize that my own proposal seemed to demonstrate a case in which they are not. Then I found the attribute naming rules and particularly that Names SHOULD NOT coincide with namespaces.

Adjusting for this, I now propose the following strings:

logsource.file.name
logsource.file.path
logsource.file.name_resolved
logsource.file.path_resolved
logsource.file.stream

arminru commented 3 years ago

The proposal in your last comment sounds great to me, @djaglowski. Thanks for iterating on this. Do you intend to go ahead with a PR to add these attributes?

djaglowski commented 3 years ago

@arminru, I'll make the PR

open-telemetry / opentelemetry-specification

Semantic Conventions for File Attributes #1772