isKnown state - Githubissues

AramZS commented 4 years ago

This builds off of the conversation around the isLoggedIn proposal by @johnwilander at https://github.com/privacycg/is-logged-in

This is a rough sketch here but I think we would like to create a more private way to understand a user as a repeat visitor to the site without having to invade the user's privacy by requiring they identify themselves.

I will explain with the use case I am most familiar with, paywalls. This is one case, but I do not think it is the only case. It is a distinct case from the one I see being discussed at https://github.com/privacycg/is-logged-in/issues/9 which is a reaction to a logged in state. The isKnown state, as I am currently proposing it, would not require a user to log in.

Many publishers use paywalls, these are important to maintain a profit for many publishers. The way this works right now is a site lays out a cookie, that cookie records the number of visits and when the number of visits exceeds the number we wish to allow freely then we can trigger a paywall and block access to content.

This process suffers from potentially the same issues that maintaining a logged in state does (which is why the isLoggedIn is a useful reference):

It is often maintained via a fingerprinting process (in the same way that sites attempt to retain the user's login outside of the cookie by profiling and recording information about their device) which is privacy invasive.
It is dependent on cookies or storage in such a way that it could potentially leak additional data about the users to third parties.

It contains an additional threat to privacy in that maintaining a connection with the user outside of login that requires tracking the user without their explicit consent (in UI terms, not diving into the legal side of this right now), which opens up problems that don't gel well with the future privacy-first web.

What I would like to have instead is a very simple non-invasive way to handle this, one not dependent on cookies or storage. This is important because, as we are already starting to see in anticipation of the post-cookie world, sites that feel they do not have some ability to meter access to content will push users to annoying UI on their first visit, asking them to log in to read even the first page they arrive on.

In a best-case scenario I'd imagine the isKnown state to be composed of two parts.

The first property is a single number which can be either incremented or reset one time per pageview by the site the user has landed on. This will prevent the site from trying to use the value for tracking or fingerprinting, but allow the site to own its definition of "known user" in a way consistent with how it is currently being done in the wild.

The second property is a single number that represents the period of hours (not seconds, not milliseconds, to avoid fingerprinting) since the last Known access.

Some potential flows:

New users:

User arrives at a site
Site sees isKnown.count == 0
User's isKnown is incremented
User's isKnown.count == 1

Returning users (short time period):

User arrives at a site for the second time
Site sees isKnown.count > 0
Site sees isKnown.time < 720
- This means it is the second time the user has visited in less than 30 days
User's isKnown is incremented
User's isKnown.count == 2

Returning user (long period):

User arrives at a site for the second time
Site sees isKnown.count > 0
Site sees isKnown.time >= 720
- This means it has been more than 30 days since the last visit
User's isKnown is reset
User's isKnown.count == 1

By having a way to measure access reliably while not relying on privacy invasive methods like fingerprinting we can better support an open web and legitimate publishers who would like to maintain users' privacy while also remaining profitable.

We've discussed this in calls a few time and I keep promising to write something up, but generally have been low on time to do so, so I'll keep this brief and try to hash it out via Q&A to see if it makes sense to advance this idea further.

gffletch commented 4 years ago

For publishers with pay-walls I can see where this would be extremely helpful. However, from a security perspective for identity providers this isn't sufficient as the Identity Provider needs to know that a specific identity (user) has used this browser to login in the past. The identity data can be encrypted so that only the identity provider can access the information and the data can be written on the fully qualified identity provider domain so as to not leak to other sites within the root domain.

jkarlin commented 4 years ago

Paywalls are an interesting case. I think the challenge for them is that I'd imagine sites with paywalls would want said paywall counters to carry over to incognito/private mode as well. Is that right?

Edit: removing bikeshedding comment as it's non-productive at this point.

samuelweiler commented 4 years ago

Why not just do this with a non-identifying cookie or pair of cookies? You write of a 'post-cookie world', but I'm still seeing interest in having some (perhaps limited) persistence for first-party cookies. Is this solving a problem that doesn't need new primitives to solve?

AramZS commented 4 years ago

For publishers with pay-walls I can see where this would be extremely helpful. However, from a security perspective for identity providers this isn't sufficient as the Identity Provider needs to know that a specific identity (user) has used this browser to login in the past. The identity data can be encrypted so that only the identity provider can access the information and the data can be written on the fully qualified identity provider domain so as to not leak to other sites within the root domain.

@gffletch I agree, that isn't the use case I intend this for. That use case is being discussed in https://github.com/privacycg/is-logged-in/issues/9. This is not a case intended to be used to identify the user, but instead an alternative used to understand patterns of access without requiring a lot of knowledge about the user to be stored.

AramZS commented 4 years ago

Paywalls are an interesting case. I think the challenge for them is that I'd imagine sites with paywalls would want said paywall counters to carry over to incognito/private mode as well. Is that right?

Edit: removing bikeshedding comment as it's non-productive at this point.

@jkarlin I mean... haha yeah, we'd love to have the paywall counters carry over to incognito/private mode! There are a lot of publishers out there now who actually block on incognito mode when they can detect it (which is infrequently). If an increased level of anonymity would allow browsers to create a cross-standard/incognito counter to understand a user a returning visitor within X time (and no other data about them) that would be amazing. I'm having a hard time imagining users being ok with that though?

AramZS commented 4 years ago

Why not just do this with a non-identifying cookie or pair of cookies? You write of a 'post-cookie world', but I'm still seeing interest in having some (perhaps limited) persistence for first-party cookies. Is this solving a problem that doesn't need new primitives to solve?

@samuelweiler A lot of paywall management is done with vendors who are third party scripts, something that seems very likely to end up blocked as we increase browser privacy. Beyond that, what forces a cookie to be non-identifying? If the goal is to give users an increasingly private experience over time with something like isLoggedIn (which is also a case that could be stored to a persistent cookie) than I think we might need this as well?

melanierichards commented 4 years ago

(My comment pertains to API shape + bits of entropy vs an opinion on the need for this proposed API, so feel free to defer until later; just wanted to get my thoughts on paper, so to speak.)

It is often maintained via a fingerprinting process (in the same way that sites attempt to retain the user's login outside of the cookie by profiling and recording information about their device) which is privacy invasive.

In the interest of avoiding fingerprinting behaviors, I wonder if it's possible to further minimize the data this proposal exposes and still address the motivating use case. Reading back a counter and a measure of recency could be pretty helpful in building up the unique-ness of a given user.

Instead of reading back a counter:

The site specifies a maximum count
The UA stores an internal counter and the maximum value
The site tells the UA when to increment the counter
In order to determine whether the user has hit the paywall limit, the site could read back a Boolean value (did the user hit the limit or not?)

We'd probably need to design this such that the site isn't setting unique maximum counts, or incrementing the max count continuously.

The time value could be similar. Instead of reading back a number of hours:

The site could set an IsKnown timeout
The UA could reset the internal IsKnown counter when the timeout is reached

arthuredelstein commented 4 years ago

This counter seems easily spoofable -- the UA just needs to reset the counter to zero and the user can read more free articles, right? So publishers will still have an incentive to fingerprint the user.

AramZS commented 4 years ago

The site specifies a maximum count The UA stores an internal counter and the maximum value The site tells the UA when to increment the counter In order to determine whether the user has hit the paywall limit, the site could read back a Boolean value (did the user hit the limit or not?)

The more I think about it the more sense this makes, I agree, avoiding any exposure to the number value of the counter is a more private design.

The UA could reset the internal IsKnown counter when the timeout is reached

This would seem to me to be the simplest approach!

And I agree we would have to think about the design of how the counter is handled in order to avoid a situation where it becomes a fingerprinting vector.

AramZS commented 4 years ago

This counter seems easily spoofable -- the UA just needs to reset the counter to zero and the user can read more free articles, right? So publishers will still have an incentive to fingerprint the user.

Users already clear cookies, switch browsers, go incognito. Generally the efficacy of trying to fingerprint the user is going down now and going to continue to go down. I think that trying to design a methodology that removes the capability of a limited number of motivated people to avoid pay/reg walls is pretty much impossible. However, clean, non-identifying, recognition of a returning viewer is a useful potential signal and could be used for a lot more than just paywalls, sites can create other incentives as well. Further, we could examine this as a signal to potentially unlock API features in a limited way in the same way isLoggedIn does, but that is a larger conversation.

AramZS commented 4 years ago

An additional factor we should likely discuss is scope: I would prefer to avoid a situation where multiple entities on a page could create different values for isKnown and potentially creating a way to identify a user through that method, but I can also see the argument for embedded entities wanting their own isKnown value to handle their own actions in cases like video or embedded articles, etc...

Not sure what the answer is yet, but wanted to mark this as an issue to discuss if we wish to progress.

hober commented 2 years ago

hi @AramZS, are you still interested in pursuing this? @johnwilander, I wonder if this could be rolled up into the Login Status API?

privacycg / proposals

isKnown state #20