open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
244 stars 158 forks source link

Embed functionality v0.2 #1264

Open trisch-me opened 1 month ago

trisch-me commented 1 month ago

After several discussions in comments and during our semconv and tooling meetings we came to conclusion to rewrite embed functionality to make embed more prominent and visible because it has potential to influence and change not only the schema but also other parts of Open Telemetry such as APIs etc.

So the proposal is to make an embed a part of the group as opposed to the part of attributes before. Both embedded namespaces and embed attributes will be visible in separate block, so the developer would know that those are treated differently

groups:
   - id: registry.client
     prefix: client
     type: attribute_group
     brief: "Describes client"

     embed:
     - attribute_groups: 
        - ref: geo
          as: location (if omitted, the name is equal as namespace prefix, in this example geo)
          brief: "My own embedded geo space"

          attributes: 
          - ref: geo.lat (override only those attributes from the group which needs to be overridden. 
                             Other attributes are added AS IS)
            brief: "This attribute is required"

      - attributes: 
         - ref: geo.lat
           as: my.latitude
           brief: "My own embedded latitude"
         - ref: geo.lon

Embedded namespaces


### Embedded singular attributes

- Embedded singular attributes will be listed under `attributes`. They might have an optional parameter `as`. If omitted, name is equal to the embedded attribute, for example resolved schema will be `client.geo.lat`. If provided, the name of the embedded attribute will be equal to the value of `as`, such as `client.my.latitude` where `my.latitude` is the alias for `geo.lat`.
- Embedded singular attributes might have it's own brief.
- Example of embedded singular attributes which resolves to `file.owner` and `file.creator`. It might be important if embedded attribute has some restrictions or requirements, so we don't redefine them everywhere multiple times but only reference them in embed section:
```yaml
groups:
   - id: registry.file
     prefix: file
     type: attribute_group
     brief: "Describes file"

     embed:
     - attributes: 
         - ref: user.name <-- embedded attribute
           as: owner
           brief: "Owner of the file"
         - ref: user.name <-- embedded attribute
           as: creator
           brief: "Creator of the file"

Questions:

implementation details

This issue is replacing https://github.com/open-telemetry/build-tools/issues/240 due to number of comments there which makes it a bit harder to follow up on final solution. I will update this message if there will be changes required.

trisch-me commented 1 month ago

@jsuereth @lmolkova @lquerel please take a look. I might have missed something but wanted to have discussion started so I will start with implementation sooner

lmolkova commented 1 month ago

Thank you!

Some questions/comments:

  1. what would be the scenario for as (i.e. embedding geo as location) ? Do we really need it? Would it replace the first segment or all segments? One of the reasons for embeddings is to have reusable concepts and renaming attributes would defeat this purpose. I'd prefer to avoid having this feature until we know if/why it's needed.

  2. embed the whole namespace: I don't see any candidate for embed-namespace today and see plenty of cases for embedding a single attribute. By embedding the whole namespace:

    • I'm auto-subscribing to all future changes in that namespace whether they are applicable or useful in my namespace or not.
    • I'm defaulting to generic attribute descriptions and notes for attributes I don't mention explicitly. It might be ok for experimental specs, but really any high-quality semconv should update briefs/notes and specify what this attribute means in the context of this semconv.

    E.g. geo defines a lot of things such as continent name and also continent code, same with url or user - one might be applicable in one place, but not in the other. I believe explicitly declaring each attribute to be embedded would result in clearer spec and fewer attributes we don't have a use-case for.

  3. If we don't need embed-all, things for common case (embed-a-few) become much easier:

     groups:
      - id: registry.foo
        prefix: foo
        ....
    
        attributes:
          - embed: user.id # or `embed_one`
            brief: Foo-specific definition of user.id
          - id: bar # regular declaration   

    It we need to introduce embed-all, it can still be done with

     groups:
       - id: registry.foo
         prefix: foo
         ....
         attributes:
           - embed_namespaces: [user, url] # you can't redefine briefs/notes/etc when embedding all. If you want to redefine, import attributes one-by-one
           - id: bar # regular declaration   

    or

     groups:
       - id: registry.foo
         prefix: foo
         ....
         embed_namespaces: [user, url] # you can't redefine briefs/notes/etc
         attributes:
           - id: bar # regular declaration   

    It seems the limitation of not being able to update briefs/etc of individual attributes when importing namespace is inevitable (i.e. you either need to mention every attribute explicitly or the fact that all unmentioned attributes are embedded too becomes implicit).

  4. Re "should we restrict embed to only registry?"

    Embedding (in my view) is a new attribute definition (with some restrictions - same type and name suffix). It can only be done in the registry. Also, there should be no requirement level properties as we don't provide requirement levels in the registry.

    Somebody should be able to reference such attributes in a non-registry group, e.g.

    groups:
     - id: registry.foo
       prefix: foo
       ....
    
      attributes:
        - embed: user.id # or `embed_one`
          brief: Foo-specific definition of user.id
        - id: bar # regular declaration   
    
     - id: traces.foo
       ....
    
       attributes:
         - ref: foo.user.id
           requirement_level: required  

TL;DR: I think semantic conventions have to explicitly mention all attributes they want to embed and need to override notes/briefs to explain what these attributes mean in the context of that group. If we wanted to reuse attributes as is, we could always reference/extend a group instead of embedding it.

I believe blindly embedding all is dangerous and would result in a lot of unused/misused attributes.

I.e. I don't think embedding a namespace is something we should implement and without it we can design things to be much easier and more readable.

trisch-me commented 1 month ago

what would be the scenario for as (i.e. embedding geo as location) ? Do we really need it? Would it replace the first segment or all segments?

it's described in the proposal. process.parent field is of type process itself.

I don't see any candidate for embed-namespace today

We do have geo namespace waiting for this feature to be implemented in order to be used inside client and server namespace. As well as mentioned process.parent. And those are not only usecases, but most prominent and needed right now

If we don't need embed-all...

We do need embed all, this is the main reason for this functionality. Embedding of dedicated attributes was just a nice add-on for this feature.

MadVikingGod commented 1 month ago

For the single attributes we already have ref and after first look it seems the only feature added here is the as field. Could this rename feature be strong enough on its own to be added as a feature to the current attribute model, e.g. add as to all attributes?

trisch-me commented 1 month ago

@MadVikingGod ref has another meaning. Embed not only reference attribute but also adds groups prefix to it. So during embedding of geo.lat into client the resolved attribute name will be client.geo.lat while during referencing it it will stay the same geo.lat

trisch-me commented 1 month ago

To reiterate some points from todays discussion here in written form:

On additional note I feel also we are mixing things within the registry and outside of it.

As of now embed will be used only to define things within the registry, for example client.geo.* means that client might use all the fields from geo. But we will use those fields outside of the registry through the usual ref, it means one can use only those attributes we have defined already in registry, maybe all of them, maybe only some.

It means one might use only ref: client.geo.lat or only ref: process.parent.id in yaml definitions outside of the registry. I think this might answer some questions @lmolkova had.

We might also decide to use embed outside of the registry but this might bring even more confusion so I propose to refrain from it until we will have a use case.

Let's continue a discussion here to find the solution @jsuereth @lmolkova @joaopgrassi @lquerel

lmolkova commented 1 month ago

Extracting the main requirement from the comment https://github.com/open-telemetry/semantic-conventions/issues/1264#issuecomment-2244141016

we should not define attributes that are not going to be used. In other words any changes to embeddable namespace need to be compatible to and meaningful in all the groups that embed them.

I don't see good examples in existing semconv where we have namespaces satisfying those requirements. Our existing approach is to "describe everything about {namespace}". To leverage typed embedding, we'd need to change how we structure things.

E.g.

The ways I see that can satisfy these requirements:

  1. do not allow namespace embedding (i.e. typed flattened structure represented with a set of attributes).
  2. use namespace embedding only for small, well-scoped, complete, clear definitions of something
  3. maybe it's possible to leverage requirement levels and say that certain properties of embeddable group are optional and are not embedded

The solution I'm suggesting (p1) is a two-way door. It does not block p2 in the future.

I'm suggesting the following litmus test before we introduce p2:

I.e. I want to have a real example where we embed the whole namespace which is merged and part of the semconv first before we introduce p2.

trisch-me commented 1 month ago

I think we need to continue discussion on embedded namespaces but I would like to start with embedded attributes because I think there is no strong objection against it.

I have a following usecase - an application needs to access foo.csv and may write a log describing this interaction to a log file which is called something else, say application.log

Let's discuss 2 options to describe the embed attributes

A. inside the group on the top level 🎉

groups:
   - id: registry.log
     prefix: log
     type: attribute_group
     brief: "Describes log file"

     embed:
        - attributes: <-- embedded attributes
          - ref: file.name
            as: file.name <-- this is can/should be omitted because it's the same name as ref
            brief: 'Filename of the log file'
            example: 'application.log'
          - ref: file.name
            as: event.file.name
            brief: "The file that the action was performed on."
            example: 'foo.csv'

    attributes:  <-- normal attributes
      - id: record.uid ...

resolving into

log.file.name - Filename of the log file
log.event.file.name - The file that the action was performed on.
log.record.uid - ...

another example with process

groups:
   - id: registry.process
     prefix: process
     type: attribute_group
     brief: "Describes process"

     embed:
        - attributes: <-- embedded attributes
          - ref: process.id
            as: parent.id
            brief: 'Id of parent process also know as ppid'
            example: '123'

    attributes:  <-- normal attributes
      - id: id
        brief: Process id
        note: Additional info <-- this will be used for both parent and process id

resolving into

process.id - Process id
process.parent.id - Id of parent process also know as ppid

Pros:

Cons:

B. inside the attributes section 🚀

Resolutions for these examples will be the same as above

groups:
   - id: registry.log
     prefix: log
     type: attribute_group
     brief: "Describes log file"

    attributes:  <-- attributes section
      - id: record.uid <-- usual attribute
      - embed: file.name <-- embedded
        as: file.name <-- this is can/should be omitted because it's the same name as ref
        brief: 'Filename of the log file'
        example: 'application.log'
      - embed: file.name
        as: event.file.name
        brief: "The file that the action was performed on."
        example: 'foo.csv'
groups:
   - id: registry.process
     prefix: process
     type: attribute_group
     brief: "Describes process"

    attributes:  <-- attributes section
      - id: id
        brief: Process id
        note: Additional info <-- this will be used for both parent and process id
      - embed: process.id
        as: parent.id
        brief: 'Id of parent process also know as ppid'
        example: '123'

Pros:

Cons:

You can simply vote for the proposed option through emoji or put your alternatives in comments.

lmolkova commented 1 month ago

If we don't pursue embedding namespaces, then option B seems super-clean and I'd be happy with it. I still have concerns about as and would suggest to remove it from the initial scope.

I believe the reason to consider option A as to make embedding namespaces and attributes consistent with each other. It'd be hard to decide on a single attributes without having some basic design for namespaces.

lmolkova commented 1 month ago

Need to put attention to embed attributes as they could be mixed with other id attributes

could you elaborate? what's the problem?

One suggestion is to do

prefix: client
attributes:
  - embed: client.geo.location.longitute
    from: geo.location.longitute
    brief: ...

(parent is tricky - renaming attributes as a part of embedding seems controversial, so make be we can take it separately?).

We can validate that client.geo.location.longitute is a prefix + from. This approach (using fully qualified attribute name in the embed) would also help with issues we already have today with id and prefix https://github.com/open-telemetry/semantic-conventions/issues/1292

lmolkova commented 1 month ago

the middle-ground:

prefix: foo
embeded_namespaces:
  - id: foo.file # the name of resulting namespace to generate
    from: file # either a group id or a whole namespace which can in theory be defined across multiple groups
    attributes: 
      - embed: foo.file.name
        # we can try to figure it out, but it's better to be explicit and 
        # then we can validate if it's the same as `embed` without `prefix`
        from: file.name 
        ...
      ...
embeded_attributes: # we can as well do it inside regular `attributes`, I have no strong opinion here
  - embed: foo.geo.location.longitute
    from: geo.location.longitute
    brief: ...

produces:

Pros:

Cons

trisch-me commented 1 month ago

I still have https://github.com/open-telemetry/semantic-conventions/issues/1264#issuecomment-2244141016 and would suggest to remove it from the initial scope.

We can't remove it from initial scope, we need to differentiate different embeddings of the same object within the same namespace. I have provided 2 examples for this case, how do you propose to implement them without using alias?

Need to put attention to embed attributes as they could be mixed with other id attributes could you elaborate? what's the problem?

This was a request from @jsuereth, he wanted to have embed attributes separate because of their meaning. I don't have strong opinion on it, I'm ok with both options, where A is a bit better in the long run, and B is better for only attributes.

I am ok with the idea of using fully qualified name in embed as well. It aligns with idea to be explicit

In your example I see almost the same structure that was already proposed but with different names, let me rename it for the same group.

groups:
   - id: registry.process
     prefix: process
     type: attribute_group
     brief: "Describes process"

    attributes:  <-- attributes section
      - id: id
        brief: Process id
        note: Additional info
      - embed: process.parent.id
        from: process.id
        brief: 'Id of parent process also know as ppid'
        example: '123'

But in this example I feel that as gives better meaning

groups:
   - id: registry.process
     prefix: process
     type: attribute_group
     brief: "Describes process"

    attributes:  <-- attributes section
      - id: id
        brief: Process id
        note: Additional info <-- this will be used for both parent and process id
      - embed: process.id
        as: process.parent.id <-- I have added here prefix for better clarity
        brief: 'Id of parent process also know as ppid'
        example: '123'
jsuereth commented 1 month ago

I'd like to add consideration for the inverse relationship for embed.

E.g. Instead of embed living on the consumer, we do something like:

groups:
  id: registry.types.geo
  type: embeddable_group
  brief: 'The set of attributes which describe a geographcial location'
  prefix: geo
  attributes:
     - id: lat
        ...
     - id: lon
       ....

And then used like the following:

groups:
  id: some_user_of_this
  type: attribtue_group
  prefix: some_user_of_this
  attributes:
    - ref_group: registry.types.geo
      name: location

Leads to to the attributes some_user_of_this.location.lat and some_user_of_this.location.lon.

trisch-me commented 1 month ago

@jsuereth in your example do you want to emphasise a need for embedded groups to have a special type i.e. embeddable_group?

Because otherwise your example looks pretty similar to what I propose, just different keywords ref_group for embed and name for as

And we also can have instead of new type a new property allow_embedding: true

lmolkova commented 1 month ago

We can't remove it from initial scope, we need to differentiate different embeddings of the same object within the same namespace. I have provided 2 examples for this case, how do you propose to implement them without using alias?

we don't before we discuss it in addition to embedding. I.e. it's a feature on top, not a core part.

  - embed: process.id
   as: process.parent.id <-- I have added here prefix for better clarity
   brief: 'Id of parent process also know as ppid'

I'd like to have full qualified name in the declaration. In this case, as should be in the declaration.

- as: process.parent.id 
  from: parent.id

If you're ok renaming, we can combine things into

groups:
  - id: registry.types.geo.location
    type: embeddable_attribute_group
    brief: 'The set of attributes which describe a geographcial location'
    prefix: geo.location
    attributes:
       - id: lat
          ...
       - id: lon
         ....

  - id: registry.foo
    prefix: foo
    embeded_namespaces:
      - as: foo.geo.location # id seems ok here too
        from: embeddable_attribute_group
        attributes: 
          - as: foo.geo.location.lon
            from: geo.location.lon
            ...
   # or
    embeded_attributes: 
      - as: foo.geo.location.lon
        from: geo.location.lon
        brief: ...
      - as: foo.geo.location.lat
        from: geo.location.lat
        brief: ...

Any objections?

trisch-me commented 1 month ago

Oh I see your point, as with id declaration you want embed also to have full name in the beginning to emphasise it? I think it's ok too.

So you propose then to have different groups to support embedded namespaces and embedded attributes. Do you think it gives better clarity to the reader then to have embed block with namespaces and attributes inside?

Can I say that we agree on moving on with embedded attributes and we just discussing syntax?

I see a couple of options proposed here:

    embeded_attributes: 
      - as: process.parent.id
        from: process.id
        brief: ...
    attributes:
      - id: ...
    embeded_attributes: 
      - id: client.geo.lon
        from: geo.lon
        brief: ...
    attributes:
      - id: ...
    embeded_attributes: 
      - name: client.geo.lon
        from: geo.lon
        brief: ...
    attributes:
      - id: ...

I feel that when using it in other yaml files outside of registry id seems to be better option because it is similar to what we have used before with ref

groups:
  id: mygroup
  type: attribute_group
  brief: ...
  attributes:
    - ref: process.id <------------} both names are taken from `id` declaration. One from normal attributes
    - ref: process.parent.id < -----}                     another from embedded
lmolkova commented 1 month ago

So you propose then to have different groups to support embedded namespaces and embedded attributes.

no strong opinion here, but I'd like to minimize nesting for what I consider to be the common case.

groups:
  - id: registry.foo
    prefix: foo
    embed:
      namespaces:
        - as: foo.geo.location # id seems ok here too
          from: embeddable_attribute_group
          attributes: 
            - as: foo.geo.location.lon
              from: geo.location.lon
              ...
      attributes: 
        - as: foo.geo.location.lon
          from: geo.location.lon
          brief: ...
        - as: foo.geo.location.lat
          from: geo.location.lat
          brief: ...

seems fine too.

I see a couple of options proposed here:

do you mean as vs id vs name? I feel strongly against id (or ref) because it would conflict with existing id on attributes. name looks great.

Can I say that we agree on moving on with embedded attributes and we just discussing syntax?

Let's figure out the syntax. Also, once we finalize the syntax, I still suggest to not go forward with namespaces right away - see https://github.com/open-telemetry/semantic-conventions/issues/1264#issuecomment-2256453485. We shoud be able to add it incrementally after we have a namespace that's embeddable.

trisch-me commented 1 month ago

Yes, I'm ok with moving only with embed attributes for now and discuss/implement namespaces later.

do you mean as vs id vs name? I feel strongly against id (or ref) because it would conflict with existing id on attributes.

I thought a comment from your side - as: foo.geo.location # id seems ok here too means you propose to have here id instead of as

I do like id, because it is the definition of the attributes inside the embed but other than this it should be very similar to id in normal attributes. So while referencing later this attribute a developer can just take always id. I'm ok with name as well, but id seems to be more applicable here.

lmolkova commented 1 month ago

attribute:id means attribute definition. Someone can define a type for this attribute. Embedding an attribute means defining an attribute, but reusing some properties of another attribute (type, stability?, deprecation?, part of the name?).

I'm strongly opposed to reuse id to embed an attribute, in the same way as I opposed to reuse ref for it. My comment on allowing id was for the namespace, not attribute. We can polish the details for the namespace later once we get to it. It seems we have a general design figured out.

trisch-me commented 1 month ago

Before moving to final syntax I have one issue with current approach I want to discuss: alias to the attribute i.e. process.parent.id for process.id should be possible, but will be most probably rare case. With current approach we will need to define it every time which leads to repetition i.e. geo.lon will be used as geo.lon. I know we try to be explicit, just want to raise this concern.

Besides of it we have now 2 options discussed above for external structure

groups:
  - id: registry.foo
    embedded_namespaces:
    ....
    embeded_attributes: 
    ...

vs

groups:
  - id: registry.foo
    embed:
      namespaces:
      ....
      attributes: 
      ....

And 2 options for internal structure of attributes:

groups:
  - id: registry.foo
    embeded_attributes: 
      - as: process.parent.id
        from: process.id
        brief: ...
groups:
  - id: registry.foo
    embeded_attributes: 
      - name: process.parent.id
        from: process.id
        brief: ...
MadVikingGod commented 1 month ago

@MadVikingGod ref has another meaning. Embed not only reference attribute but also adds groups prefix to it. So during embedding of geo.lat into client the resolved attribute name will be client.geo.lat while during referencing it it will stay the same geo.lat

Yes, but what I'm saying is if we were to add as to the current attribute struct, it could interact like so:

groups:
  - id: client
    prefix: client
    attributes:
    - ref: geo.lat
      as: client.geo.lat

which should be rendered as client.geo.lat

That keeps all of the current overrides, and it doesn't need some other field that does the same thing but different.

trisch-me commented 1 month ago

@MadVikingGod we already have this option to use ref as embed (I have implemented it in weaver) but this is exactly why we were having new discussion about it. Because semantically reference and embedding are different and they should also be treated differently: not to mix usual ref and embed ref, make embed more visible, currently proposition to have embed not in the regular attributes but on the top level of the group.

This PR is using ref with additional attribute prefix: true and it generates from ref: user.id a fully qualified name process.user.id

MadVikingGod commented 1 month ago

What I'm saying is that attribute embedding, not the type discussion we had in the SIG, seems to me to be a more complicated ref+rename.

It seems like the requirements for renaming fall under:

My point of view is that if the functionality is so close we probably should add the feature to ref not create another new config type in the semconv.

I also think separating attributes into two different areas (attributes and embed.attributes) is also a usable problem, but that's a minor concern.

This analysis is totally different for embedded types/namespaces. For that the concept is so different from referencing an attribute that I think it deserves it's own top level directive.

trisch-me commented 1 month ago

@MadVikingGod I understand you and as I told before this was original idea that I have even implemented into weaver, you can use it now in semconv because it's not reverted yet

But this is where @lmolkova requested changes and the whole discussion started from the scratch.

lquerel commented 3 weeks ago

@trisch-me Regarding the syntax options, I have a slight preference for embed -> namespaces and attributes. As for as vs name, I don’t really have a preference.