open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
256 stars 165 forks source link

User.id for authenticated user id #1104

Open heyams opened 4 months ago

heyams commented 4 months ago

Area(s)

area:user

Is your change request related to a problem? Please describe.

enduser.id has been deprecated and replaced with user.id. #731

enduser.id had this old description: image

user.id has this new description: image

The new description is confusing now. Is it for authenticated user id or anonymous user id? What are your thoughts on creating a new attribute called user.anonymous_id?

Our telemetry solution tracks both authenticated user id and anonymous user id.

Describe the solution you'd like

  1. update the user.id description to make it clear that it's intended for authenticated user id.
  2. create a new attribute call user.anonymous_id for the anonymous user id.

Describe alternatives you've considered

n/a

Additional context

n/a

trisch-me commented 4 months ago

hey @heyams - user namespace is not bounded to auth domain, it could be anything - user in operation system, user in database etc. How did you use enduser.id field before for authenticated and anonymous users? Also previously enduser.id had both client_id and username in it, now user namespace has dedicated fields for it, i.e. id and name

thompson-tomo commented 4 months ago

What I would propose is that we introduce an additional property user.authenticated which describes if the user is an logged in or not. Alternatively we could add in user.authenticstionscheme which could be anonymous, basicauth, openid

We could also add a user.authenticstionprovider ie Facebook, local, domain

Potentially we could also add in user.authorized which would be useful in the case where an action fails due to user lacking the authorization to complete the task/activity.

trisch-me commented 4 months ago

there was a discussion about additional sub-namespace for user such as user.auth.* and add there appropriate fields. We have discussed about user.auth.domain field. Other fields could be also added under this sub-domain as well if needed

thompson-tomo commented 4 months ago

Ok with the idea of sub domains, how about we use this to track a discussion about implementing user.auth.authenticated to address the gap raised in this issue and we/I create a seperate issue to track extending user.auth.* to include other useful aspects from the oidc jwt.

MSNev commented 4 months ago

Previous discussions (from the client rum sig) https://github.com/open-telemetry/semantic-conventions/pull/443

thompson-tomo commented 4 months ago

Ok I see valid points in that discussion how about We introduce the following:

For the device attributes how about we also introduce a session attribute?

heyams commented 4 months ago

Ok I see valid points in that discussion how about We introduce the following:

  • User.auth.Authenticated so that we can now if the user is authenticated
  • User.session so we can track an anonymous user and continue the session when they become authenticated.

For the device attributes how about we also introduce a session attribute?

Sessions only lasts as long as the browser is open, and it's a different concept. What do you think about @trisch-me's suggestion having a sub namespace under user

user.auth.* => user.auth.authenticated can be used to track if it's authenticated, and then user.auth.id can be used for the authenticated user id, if user.auth.authenticated is false, same id can be used for anonymous user id?

alternatively, we can add user.authenticated boolean attribute, when this is true, user.id can be used for authenticated user id; otherwise, anonymous user id?

@trisch-me @MSNev thoughts on this?

thompson-tomo commented 4 months ago

not a fan of user.authenticated as it will be too limiting especially if you want to track an anonymous user become an authenticated. So fully support implementing user.auth.authenticated especially as it would enable us to see a user which has failed authentication.

to be honest i don't see an issue an issue with tracking the users session via a user.session attribute given rather than a user.id attribute. Scenarios would be:

atreat commented 3 months ago

I think having a separate attribute for an anonymous user id makes sense. I'd keep user.id to be as concise as possible and formalize the description of this attribute to be specific to an authenticated user.

For an unauthenticated user, I'd recommend a separate attribute that we can try to name (user.anon_id, user.anonymous_id, user.transient_id, etc.). I'm not too particular on what this is called but prefer to keep it flat instead of a sub-namespace.


I think anonymous users are a feature that should be considered separate from user sessions because:

  1. It's possible that an anonymous user identifier is generated and is sticky, so we can follow that same user across multiple sessions.
    1. It's possible that an anonymous user eventually authenticates and identifies themselves. In this case a single session has telemetry that contains the anonymous identifier and the authenticated identifier. It'd be possible to create a new session when the authentication occurs, but that would prevent the opportunity to understand what led the user to authenticate.
    2. Depending on the application, you may have multiple anonymous users within a single session. If an application is running on a kiosk, it's possible that the application remains open while multiple users walk up, interact, and then walk away. It'd be up to the application developer to decide if they create a new session or create a new anonymous user id. This is a subjective decision by the developer, but I like that the convention of the anonymous identifier provides that flexibility.
thompson-tomo commented 3 months ago

In the case of following a "user" across multiple sessions how can we be fairly certain it is the same user? For instance a new user at the kiosk. Propose tracking the actual device as we are using some identifier stored on the device. The key thing for me is the context is bound to a device.

I feel that we should leave it to developers to decide what triggers the start of a new session, be it time based or a user clicking start on a kiosk. Perhaps we look at adding a convention such as device.session so we can track all traces which have occurred since the app was launched.

I think the approach of being able to trace all the activity coming from a device, drilling it down to a session & then taking it that step further to see the data related to an individual user. The final drill down is to see what was down while authenticated including who they are.

atreat commented 3 months ago

I think the approach of being able to trace all the activity coming from a device, drilling it down to a session & then taking it that step further to see the data related to an individual user.

Agree with this completely. I think having separate identifiers for the device and session provide value. I was just trying to provide examples where an unauthenticated user may not match up 1:1 with a device or a session. Hoping that a convention specific to an "unauthenticated user" gives app developers flexibility to model their telemetry to fit their use case.


In the case of following a "user" across multiple sessions how can we be fairly certain it is the same user? For instance a new user at the kiosk.

These should be thought of as separate examples. In the kiosk use case you would likely not make your unauthenticated user id sticky. In applications where it's more safe to assume that the user is the same (an app on a mobile phone), it would be more useful to persist a longer lived value for their identifier.

I would recommend a session identifier alongside an anonymous user id in both these examples. In kiosk mode, the application may decide to keep a session open for multiple customers. When in pocket mode, an application may decide to identify a potential customer for multiple sessions.

thompson-tomo commented 3 months ago

So let me try and summarise the current state of where we at as I see it in short Form:

Open question

I am of the later thought especially if we can also release guidance on session track which includes the following examples at a Min

MSNev commented 3 months ago

@thompson-tomo

Do we need to add an anonymous user id as proposed or is introducing a sticky device.id a better approach given the lack of certainty about the user remaining the same.

No, the device.id is not the same as an anonymous user id, they are and need to be kept separate. The device.id is specific for the device that (one or more) users are using

thompson-tomo commented 3 months ago

No, the device.id is not the same as an anonymous user id, they are and need to be kept separate. The device.id is specific for the device that (one or more) users are using

Yes I am aware that multiple unauthenticated users could use the one device.

The thing which I am questioning is why we call it anonymous user id, when the user can't come back, multiple people could be involved given that we have no reliable way of knowing when the user switches and instead I propose that we refer to it as user.session &/or device.session depending on the use case.

The key thing is allowing using a combination of fields depending on the use case to achieve maximum coverage and the seven scenarios described.

heyams commented 3 months ago

No, the device.id is not the same as an anonymous user id, they are and need to be kept separate. The device.id is specific for the device that (one or more) users are using

Yes I am aware that multiple unauthenticated users could use the one device.

The thing which I am questioning is why we call it anonymous user id, when the user can't come back, multiple people could be involved given that we have no reliable way of knowing when the user switches and instead I propose that we refer to it as user.session &/or device.session depending on the use case.

The key thing is allowing using a combination of fields depending on the use case to achieve maximum coverage and the seven scenarios described.

As I have mentioned earlier that session only lasts as long as the browser is open, and it's a different concept.

thompson-tomo commented 3 months ago

Yes I am aware that a session lasts only as long as the browser/app is open. What I am failing to see is how can a anonymous user id be safely reused?

For me the tests should be:

Based on the above logic all the id's become complementary & defined scope. Most importantly for me it enables us to see all activity coming from a device during a session and that can be split based on the user.session with those sessions being able to be split based on the authenticated user id

MSNev commented 3 months ago

In the client space you can have

So when there is a single user of a computer then (if provided)

And for when multiple users are using the same (shared) hardware

So we SHOULD NOT confuse the concept of "users" and "sessions" as for client environments they can and are often different.

So the "user" attributes identify "who" is doing something (both anonymously and explicitly identified), which the "session" identifies "what" is occurring, so its technically possible to identify across a sequence of requests that how an end user is using a system so it's possible to answer questions like

trisch-me commented 2 months ago

related discussion about user.id https://github.com/open-telemetry/semantic-conventions/issues/1172

trisch-me commented 2 months ago

After reading all the discussions I am in favor of just having user.anonymous_id in addition to user.id We should update user.id description saying that it also represents authenticated user if there is an auth context. Because sometimes user is just a user, for example file.owner is just a user, who has created that file. It doesn't have direct auth connotation but might have indirect i.e. user has been logged in while creating that file.

jsuereth commented 2 months ago

To add my discussion here from the meeting:

My concern is around the use of anonymous being ambiguous and possibly misleading.

This attribute means: "We don't know the identity of the user, so we invented an ID to track behavior, e.g. for RUM".

What this does NOT mean is "We have an anonymous identifier (removed personally identifying information)".

I'd prefer phrase this in some way to make it clear what's happening. E.g. "user.unknown_id", "anonymous_user.id", "user.unauth_id". I believe @MSNev had a good recommendation.

heyams commented 2 months ago

I recognize the potential for confusion with the term anonymous; it might not be clear and could lead to misunderstandings. @trask has proposed holding a vote to help decide on a new name for this attribute.

The following options were proposed in today's semantic conventions SIG and after the discussion with my team at Microsoft:

❤️ user.pseudo_id 🎉 user.tracking_id 👍 user.unauth_id 🚀 user.auth_id for authenticated user id and then use user.id as any other user id including this anonymous user id 👀 user.anonymous_user_id 😕 user.anonymous_id

description: a consistent id to track a best-effort unique user regardless the authentication state.

Note: I didn't add user.unknown_id because it can be a known user.

It's ok to vote for multiple options. Please vote.

lmolkova commented 2 months ago

I think there are two problems leading to confusion:

  1. user is wider than website user. Attempt to add a generic attribute in user that's only applicable to browsers would be confusing for other types of users
  2. user.id|name|hash are not specific enough.

From browser perspective, it sound like user login should be populated in user.name (?). Anonymized (hashed) should be populated in user.hash (?) and then it's not clear how user.id would be used.

E.g. we can do:

user.name = lmolkova
user.hash = 864342fc7c9b552c2bea0513c9a47942 // md5("lmolkova")
user.id = 686f96e7-23d9-4c13-b5c0-7bc249d3f058 // guid recorded in my cookies

would it be helpful if we did this instead?

user.name = lmolkova
user.hash = 864342fc7c9b552c2bea0513c9a47942 // md5("lmolkova")
user.anonymous_id = 686f96e7-23d9-4c13-b5c0-7bc249d3f058 // guid recorded in my cookies
user.id = ? // nobody knows, probably same as user.name?

Yes, it'd make it more obvious for browser-specific case, but it would make things in user namespace even more confusing in general.


TL;DR: do we really need a new attribute? Can we reuse user.id for an anonymous user id in browsers?

I think it's the same option as "🚀 user.auth_id", but without introducing an attribute for authenticated user id - we have user.name for it.

It'd be great to have a md file for user in the context of browser/website that describes which user properties are applicable and how they should be populated.

MSNev commented 2 months ago

@lmolkova It's not just the browser space, it's clients in general. And there are scenarios where an application may wish to record both authenticated and unauthenticated id's of the system.

Generally, for the browser scenario (specifically Azure Monitor), there is the user.id (anonymous / random guid -- always present) and an optional user.auth_id (string) populated by whatever the application wants, sometimes its their email, sometimes it's their object id, I don't believe I've ever seen this as their name.

lmolkova commented 2 months ago

do we need to record login and some other id for authenticated user? i.e. why do we need user.auth_id if login was recorded in user.name ?

MSNev commented 2 months ago

do we need to record login and some other id for authenticated user? i.e. why do we need user.auth_id if login was recorded in user.name ?

Yes, some companies WANT to record the actual person who did the work for their internal auditing. And it's not always their name.

Which is why I voted to "reclaim" user.id and the "random" identifier and introduce a user.auth_id so this could be used as required by the application. Failing that option keeping the user.id as the authenticated one and having a user.uaid for the random one would work.

lmolkova commented 2 months ago

Yes, some companies WANT to record the actual person who did the work for their internal auditing. And it's not always their name.

Would it be better if it was called user.login instead of name ? I.e. unique, but human-readable identifier

MSNev commented 2 months ago

Would it be better if it was called user.login instead of name ?

While the example I gave was "associated" with the authenticated details (object id or email), it's not necessarily (100 %) the "login" it could be anything, and just as @jsuereth doesn't like calling it anonymous calling "some" (potentially) user identifying id the "login" is also not correct...

What should be recorded in a field called "login", should it be the username they entered during initialization, their associated (primary) email address (what happens when they sign in with a phone number) or some random OTP via a secondary (multifactor) device... Or even worse, they sign in with some 3rd party integration (for facebook, google, microsoft, etc) it's the app internally associates that id with an application "id" (like just a number)... So NO I don't like using login as a term for this.

lmolkova commented 2 months ago

ok, so from the browser perspective:

The point I'm making is that by adding a new attribute to this namespace will make things even more confusing.


The things we need to record for browser users:

It we reuse user namespace, I think the least confusing option would be to

An alternative would be to define attributes in a new/different namespace. E.g.:

trisch-me commented 2 months ago

Proposal from @heyams where we always have user.id and additionally user.auth_id seems more straightforward and applicable for different usemcases. I also would like to propose the idea of creating a sub-namespace auth and put there fields related to authentication, so use user.auth.id instead of original user.auth_id

trask commented 2 months ago

It we reuse user namespace, I think the least confusing option would be to

  • remove user.id
  • add user.anonymous|guest|visitor.id and user.authenticated.id

I like this.

I think there was some concern about the term anonymous, so maybe

mjwolf commented 2 months ago

It we reuse user namespace, I think the least confusing option would be to

  • remove user.id
  • add user.anonymous|guest|visitor.id and user.authenticated.id

I think user.id needs to be kept for the OS user use case. User ID is a well-defined concept, without any other qualifiers. For example, from the POSIX specification for getpwuid: https://pubs.opengroup.org/onlinepubs/9699919799/functions/getpwuid.html, this refers to "user id"/uid, many times without any further qualification on user.

For a more concrete example of a security use case, Falco alerts can have a field user.uid, defined as just "user ID". I think it would make sense to map this to user.id in the registry, there's no qualifier or other namespace that would really make sense.

lmolkova commented 2 months ago

let's separate user.id conversation so we can make progress on user.authenticated.id since it seems we have a consensus there.

I believe my concerns on user.id are captured in https://github.com/open-telemetry/semantic-conventions/issues/1172 - it has a very limited scope (OS user id), but a very generic name.

heyams commented 2 months ago

@lmolkova if we agree on using user.authenticated.id, which new attribute should we use for the anonymous ID, considering the potential removal of user.id? @trisch-me raised a good point here.

What do you think about user.auth.id or having a sub-namespace under user, such as user.auth?

It seems we have reached a consensus via poll to have an authenticated user ID along with another attribute for a different ID:

image image

Now, it's just a matter of naming it.

lmolkova commented 2 months ago

let's use user.id for unauthenticated for the time being. It may change as an outcome of #1172.

I suggest user.authenticated.id because we don't recommend abbreviations

https://github.com/open-telemetry/semantic-conventions/blob/aea69f203fd442c1dbdbb1479875f4de58d184d2/docs/general/attribute-naming.md?plain=1#L74-L82

auth is an abbreviation, it's ambiguous and can be read as authenticated, authorized, authentication, etc. Using authenticated is explicit and follows the guidelines.

trisch-me commented 2 months ago

I would propose using authentication instead of authenticated. Latter implies an activity, as in user has been authenticated. But I would like to introduce other statis attributes related to the authentication, such as user.authentication.domain, which was skipped because we need an auth sub-namespace for it

Zenithar commented 1 month ago

👋 - I was shimming into this thread while looking for standard authentication span tag conventions. What about client authentication (workload authentication) vs user authentication (workforce authentication)?

Do you also register a client.authentication.* namespace? For example, the trust model in OAuth is based on client and user identities. Or I can have a workload authentication based on mTLS, transporting a user authentication context, such as it represents an on-behalf-of intent.

I would extend the authentication namespace to this. :

# String identifier to describe the authentication methods associated to the context
# Example: 
# user.authentication.methods = "pwd,mfa" (ref - https://www.rfc-editor.org/rfc/rfc8176#section-2) 
# client.authentication.methods = "mtls"
<identifiable>.authentication.methods = <string list> 

# String identifier to describe the subject identifier (aka user_id)
# Exemple: 
# user.authentication.identity.subject = "arn:aws:iam::123456789012:user/johndoe"
# client.authentication.identify.subject = "spiffe://example.org/ns/default/sa/default"
<identifiable>.authentication.identity.subject = <string>

# Pseudo-anonymised subject for privacy 
# Exemple: 
# user.authentication.identity.subject_hash = "5b8491046bd5db5e945654dcc60343b367f181cc642a449c150ddd42e1e4b880" # HEX(HMAC-SHA256($key, $subject)) 
<identifiable>.authentication.identity.subject_hash = <string>

With <identifiable> as a user or a client.

By the way, I would not recommend using user.authentication.id as id is too generic and lets the convention users store anything that could match their understanding. By doing this, convention users would be invited to use the identifiable object datastore ID (PK, Mongo ID, etc.) as a span tag, which will propagate technical implementation information. At the same time, the subject is not a datastore-dependent value holding all the necessary information to look up the associated identity.

Secondly, using an identity sub-namespace offers extension points that could be used according to the associated authentication.methods (mtls => public key fingerprint, client certificate fingerprint, client certificate SANs; private_jwt => public key fingerprint; etc.).

thompson-tomo commented 1 month ago

I would propose using authentication instead of authenticated. Latter implies an activity, as in user has been authenticated.

I agree with authentication as if the authentication has failed this needs to be captured as well.

heyams commented 1 month ago

👋 - I was shimming into this thread while looking for standard authentication span tag conventions. What about client authentication (workload authentication) vs user authentication (workforce authentication)?

Do you also register a client.authentication.* namespace? For example, the trust model in OAuth is based on client and user identities. Or I can have a workload authentication based on mTLS, transporting a user authentication context, such as it represents an on-behalf-of intent.

I would extend the authentication namespace to this. :

# String identifier to describe the authentication methods associated to the context
# Example: 
# user.authentication.methods = "pwd,mfa" (ref - https://www.rfc-editor.org/rfc/rfc8176#section-2) 
# client.authentication.methods = "mtls"
<identifiable>.authentication.methods = <string list> 

# String identifier to describe the subject identifier (aka user_id)
# Exemple: 
# user.authentication.identity.subject = "arn:aws:iam::123456789012:user/johndoe"
# client.authentication.identify.subject = "spiffe://example.org/ns/default/sa/default"
<identifiable>.authentication.identity.subject = <string>

# Pseudo-anonymised subject for privacy 
# Exemple: 
# user.authentication.identity.subject_hash = "5b8491046bd5db5e945654dcc60343b367f181cc642a449c150ddd42e1e4b880" # HEX(HMAC-SHA256($key, $subject)) 
<identifiable>.authentication.identity.subject_hash = <string>

With <identifiable> as a user or a client.

By the way, I would not recommend using user.authentication.id as id is too generic and lets the convention users store anything that could match their understanding. By doing this, convention users would be invited to use the identifiable object datastore ID (PK, Mongo ID, etc.) as a span tag, which will propagate technical implementation information. At the same time, the subject is not a datastore-dependent value holding all the necessary information to look up the associated identity.

Secondly, using an identity sub-namespace offers extension points that could be used according to the associated authentication.methods (mtls => public key fingerprint, client certificate fingerprint, client certificate SANs; private_jwt => public key fingerprint; etc.).

👍

(Next Monday is a holiday in the U.S.A) I will share this in the next next Monday's Semconv SIG. Here is my finding so far:

existing user namespaces in semantic-conventions repo: 

======================================
**What do we have currently**:

User: 
    -id
    -name
    -hash
    ...

Enduser (deprecated due to ECS https://www.elastic.co/guide/en/ecs/current/ecs-user.html)
    -id
    -name

process.real_user.id
process.saved_user.id
process.user.id

======================================
**What do we want to accomplish**:

1. clarify `User` namespace, is it too broad? should user namespace be used with nesting, 
    e.g. `os.user, client.user, service.user, server.user, browser.user, db.user`

2. capture end user in a different namespace:
    * app.user (app can be a process, service, or client, mobile app, web app, what is an app)
    * enduser (it's clear that this is for the end user, not for db, process, or service)

    app
        - name
        - user
            - id // maybe PII
            - name // not PII
            - anonymous_id // not PII
            - hash // not PII

    or 

    Enduser
        - id
        - name
        - anonymous_id
        - hash
3. `authentication` would be a sub-namespace under `<parent_namespace>.user`, e.g. `db.user.authentication`

Feel free to offer feedback or discuss it in the SIG meeting.

trisch-me commented 6 days ago

Hey @heyams. thanks for info. To answer your questions: User namespace should definitely be used under other namespaces. We do this already in ECS. The question for me - do we want to allow user namespace to be independent and be used as a root namespace. In order to use this nesting we need embed feature, which will be implemented in tooling. I think the main problem here is if we have generic enough fields for multiple use cases and this is what we should solve in the first place

Regarding authentication I'm in favor of having it under user namespace and move all related fields there.

lmolkova commented 6 days ago

Based on the discussion in Semconv SIG on 9/30:

Action items:

We'll have another discussion on enduser attributes naming for tracking/anonymous id and authenticated id.

trisch-me commented 5 days ago

I have checked for ECS - within Elastic our usual case is actually to use user in the root level, without additional parent namespace. It also makes querying the data easier - you can just search for user.* fields

We do use user with parent namespace in those cases where it's ambiguous - for example process has multiple types of users, therefore every user has it's own name, i.e. process.real_user, process.saved_user etc.

Or in cases of one user (actor) performs operation on another user (target) we need to namespace the users to distinguish them.

In most cases the context provides enough understanding of the type of user being referenced, making additional namespacing unnecessary. Multiple usage of the same user, or any other namespace, will be supported in embed by using an alias field such as as, during embedding of the namespace.

I was also thinking about differences between the multiple domains, my suggestion is to make a comparison to understand where do we have differences/unclear field usage. This would help us determine if, in fact, they are not as different as we initially thought. As discussed during the meeting, it might turn out that users/instrumentation could simply skip fields that are not applicable to their specific use case.