w3ctag / design-reviews

W3C specs and API reviews
Creative Commons Zero v1.0 Universal
334 stars 56 forks source link

Partial freezing of the User-Agent string #467

Closed yoavweiss closed 4 years ago

yoavweiss commented 4 years ago

Goedenavond TAG!

This is not your typical spec review, and is highly related to https://github.com/w3ctag/design-reviews/issues/320. But, because @torgo asked nicely, I'm opening up a review for a specific application of UA-CH as a replacement for the User-Agent string.

We've had a lot of feedback on the intent, which resulted in changes to the API we want to ship. It also resulted in many open issues. Most either have pending PRs or will have ones shortly.

The latest summary is:

Checkboxes:

Further details:

You should also know that...

[please tell us anything you think is relevant to this review]

We'd prefer the TAG provide feedback as (please delete all but the desired option):

🐛 open issues in our GitHub repo for each point of feedback

jwrosewell commented 4 years ago

Dear Yoav,

Over the past two weeks we have sought evidence to justify this change. We have found none. None of the industry stakeholders we have spoken to were previously aware of the proposal. This means that the usual and necessary protocols of public review are lacking.

In its current form this could easily be interpreted as a partisan gerrymandering attempt by the incumbent dominant player in the field, to the disadvantage of other players.

In our conversations with other players, most recently at the Westminster Policy Forum, we found that they thought that this was all part of a cookie discussion. Solicitation of feedback from a wide global cross section of stakeholders via a neutral party is now required. The W3C comes to mind as fulfilling exactly that role.

This proposal is too radical and has too much potential for disruption to be pushed though quickly, even if you accept the privacy arguments, which we remain unconvinced by.

Considering purely the macro issues associated with good governance and controlled change we observe the following:

  1. The proposal has not been reviewed beyond a small group of dedicated, focused and elite engineers who in the main add – not take away – excellent features. This change impacts global industry sectors as diverse as publishing, marketing, advertising, technology, charities and more in yet undetermined ways. The risk is equivalent to the Y2K bug.

  2. Online platforms – and Google in particular – are the subject of a wide ranging Competition and Market Authority (CMA) review covering the subjects of this proposal. The full report is expected to be published on 2nd July 2020. An interim report was published on 18th December 2019. The interim report establishes a balance between the needs of individuals for privacy and for markets and technology to function efficiently. Its conclusions should inform this proposal.

  3. This proposal will disproportionally benefit Google as in practice it will remove data for smaller platform operators and millions of others. Paragraph 60 appendix E of the CMA review states.

“Google is the platform with the largest dataset collected from its leading consumer-facing services such as YouTube, Google Maps, Gmail, Android, Google Chrome and from partner sites using Google pixel tags, analytical and advertising services. A Google internal document recognises this advantage saying that ‘Google has more data, of more types, from more sources than anyone else’.”

The appendix includes the following diagram to illustrate the point.

Chromium2

  1. The CMA are yet to comment on Google’s role in relation to influence over web standards via the Chromium project and other means. Microsoft’s decision to adopt Chromium, and the apparent decline of Firefox are likely to be topics they comment on in July 2020.

  2. We need more time to discuss the impact with our users. We believe this is true for others.

As just one example the AdCom specification needs to be updated. Only once this is done can all publishers, SSPs, exchanges and DSPs adopt the new schema. If any of these parties do not make the modifications all are disadvantaged. The change needs to be made in lockstep.

Many trade bodies and organisations are focused on the implications associated with the publicity concerning 3rd party cookies. They are only just becoming aware of this proposal. We are encouraging them to engage publicly but respect the demands on their time, limited resources and the sensitivities concerning the topic of privacy.

There are many more arguments concerning assumptions, insufficient evidence, implementation, control over privacy (who decides?), and the technical impacts of the proposal yet to be resolved.

In summary, this is a change that requires careful and widespread consideration, and a significant effort to socialise for it to be recognised by all as legitimately in the public interest. Without mature reflection and appropriate implementation delay it will be perceived as market manipulation by the incumbent player.

Regards,

James Rosewell - for self and 51Degrees

torgo commented 4 years ago

One specific concern have about this proposal has to do with how "minority browsers" are impacted. Let's consider non-Chrome browsers that are also based on Chromium (as one aspect of browser diversity). It's not clear to me from reading the explainer how you intend non-Chrome browsers based on Chromium to make themselves known through Client Hints. If a hypothetical Chromium-based browser, let's call it Zamzung Zinternet, sends Sec-CH-UA: "Chrome"; v="70" (for example) that might match up with the Chromium engine they are shipping, but it won't line up with the feature set (since their feature set may not exactly match the chromium engine number) and web site owners will lose all the analytics to understand which browsers their users are using. However, if they send Sec-CH-UA: "Zamsung Zinternet"; v="17.6" then it's very likely that many web sites will give their users a bad user experience, or flash a message up encouraging people to download Chrome. You mentioned on this thread that "The UA-CH design is trying to tackle this by enabling browsers to define an easily parsable set of values, that will enable minority browsers to be visible, while claiming they are equivalent to other browsers" however I don't see that reflected in the explainer. Can you be more explicit about this? Secondly, regarding analytics, have you validated this approach with your own Google Analytics team, who currently use the UA to extract this information?

Please note, the TAG's ethical web principles argues that there is an inherent value of having multiple browsers. We should not be introducing a change the web platform that could result in making browser diversity less apparent / less measurable, as this could negatively impact browser diversity.

jyasskin commented 4 years ago

@torgo, do the examples in https://github.com/WICG/ua-client-hints/blob/master/README.md#should-the-ua-string-really-be-a-set help with your first concern?

yoavweiss commented 4 years ago

As @jyasskin pointed out, the examples there should clarify what we had in mind on that front

As for the Zamzung Zinternet case, I'd expect it to send out a set that looks something like Sec-CH-UA: "Chrome"; v="70", "Chromium"; v="70", "Zamzung Zinternet"; v="10". That would enable sites that haven't bothered testing on it to consider it the same as they consider other Chromium browsers, enable sites that want to target it specifically, and will enable analytics (that are aware of it) to understand what specific browsers those users are coming from.

Does that help alleviate your concerns?

Also, note that there's a discussion on maybe putting more emphasis on the engine vs. the UA brand: https://github.com/WICG/ua-client-hints/issues/52

To me, the Zamzung Zinternet case sounds like a good example in which we should prefer the current spec, over switching over to sending only the engine by default.

Please note, the TAG's ethical web principles argues that there is an inherent value of having multiple browsers. We should not be introducing a change the web platform that could result in making browser diversity less apparent / less measurable, as this could negatively impact browser diversity.

Beyond the privacy benefits of this change, it has an explicit goal of discouraging unreliable UA sniffing, as well as problematic UA sniffing patterns such as allow and block lists. So its intent is to discourage patterns that harm browser diversity.

mh0478025 commented 4 years ago

Hi Yoav,

First and foremost, thank you for giving the opportunity to members of the community to engage in this discussion.

I'm very concerned regarding the reasoning behind this change: bits of entropy.

As a small ad network, we use ip and user agent data to combat ad fraud. These same bits of entropy are used when we detect ad fraud. This is virtually impossible to do if all we are getting is a rotating vpn-based ip address and "Chrome 74". At best, we have to wait for another request to get the rest of the UA data (significantly reducing our ad serving speed) or worst case, we will "exceed the user's privacy budget" and be denied this information altogether.

Who decides how "the user agent can make reasonable decisions about when to honor requests for detailed user agent hints"? There is absolutely no doubt that your own properties will be ranked high in a hypothetical "trust/privacy" rating. There is nothing stopping you or your successors from abusing this power against smaller players like in the case with Yelp.

Just because your organization has hundreds of millions of logged in active users across your various web properties and devices (search, chrome, android, chrome os, youtube, gmail, pixel etc.), it is significantly easier to run ad fraud analysis and protect your own ad network as the rest of us bite the dust.

On the github repo it states "Top-level sites a user visits frequently (or installs!) might get more granular data than cross-origin, nested sites, for example". What about smaller sites just starting out? With all the large players (that already have top-level sites a user visits frequently) remaining untouched, you are effectively crippling competition from smaller players.

Vast majority of the internet users simply don't care about this change and the handful that do, are probably underestimating the anti-trust issues that this change brings. One would have to be naive to think that there is absolutely no conflict of interest when a company that collects the most amount of data on earth is limiting what other, less frequently visited sites are allowed to see?

Reducing bits of entropy is simply not a good enough reason to proceed with the change. I adore chrome and personally use it every day, however, I'd like to point out to the community that this change is not in everyone's best interest.

TL;DR: We need the full os and browser version to survive as a small ad network. And more importantly, we need this data as part of every first http request, without being discriminated against for being a less frequently visited / smaller website.

Regards, Andy

torgo commented 4 years ago

@yoavweiss @jyasskin

Sec-CH-UA: "Chrome"; v="70", "Chromium"; v="70", "Zamzung Zinternet"; v="10" Does that help alleviate your concerns?

Sort of...? But couldn't this just revert over time to being just as messy as the current UA String?

And yes, from the PoV of non-dominant browsers, who often need to maintain support for their development by citing usage numbers, I don't think it would be a good thing to suppress the the browser name.

I do not think you can engineer a system to 100% eradicate browser sniffing and targeting. Some of this will always have to be tackled through best practice sharing and community pressure.

yoavweiss commented 4 years ago

But couldn't this just revert over time to being just as messy as the current UA String?

It could. I'm hoping a decent application of GREASE can help prevent it.

I do not think you can engineer a system to 100% eradicate browser sniffing and targeting

I agree. But I want us to try and disincentivize negative behavior related to that (e.g. block and allow lists)

kiwibrowser commented 4 years ago

Hi @yoavweiss

"I want us to try and disincentivize negative behavior related to that"

Did you consider removing the installation and Google-specific tracking headers (x-client-data) that Google Chrome is sending to Google properties ?

Example: https://www.youtube.com - in network headers, look for x-client-data

Now, go to https://ad.doubleclick.net/abc - and your browser also sends this magic x-client-data.

It's a unique ID to track a specific Chrome instance across all Google properties.

Really curious about your opinion, especially after the GDPR explicitly forbidding such tracking. Moreover, it doesn't make sense to anonymise user-agent if you have such backdoor.

gjsman commented 4 years ago

From HN related to @kiwibrowser 's post: https://www.google.com/chrome/privacy/whitepaper.html

"We want to build features that users want, so a subset of users may get a sneak peek at new functionality being tested before it’s launched to the world at large. A list of field trials that are currently active on your installation of Chrome will be included in all requests sent to Google. This Chrome-Variations header (X-Client-Data) will not contain any personally identifiable information, and will only describe the state of the installation of Chrome itself, including active variations, as well as server-side experiments that may affect the installation."

he variations active for a given installation are determined by a seed number which is randomly selected on first run. If usage statistics and crash reports are disabled, this number is chosen between 0 and 7999 (13 bits of entropy). If you would like to reset your variations seed, run Chrome with the command line flag “--reset-variation-state”. Experiments may be further limited by country (determined by your IP address), operating system, Chrome version and other parameters.

chris-griffin commented 4 years ago

If usage statistics and crash reports are disabled, this number is chosen between 0 and 7999 (13 bits of entropy)

This is a misdirect. First, according to the same cited whitepaper, Usage statistics are "enabled by default for Chrome installations of version 54 or later". This means that nearly all Chrome installs will have a very high entropy.

And even if a user disables usage statistics, a low entropy seed will very likely still yield a high entropy string since it includes "the state of the installation of Chrome itself, including active variations, as well as server-side experiments that may affect the installation."

If you want to use this argument, the equivalent would be allow users to disable their User Agent, but to send it by default. This seems like a much more sane approach.

jwrosewell commented 4 years ago

This thread is already highlighting a set of issues which can be summarised as:

  1. Consultation – ensure a multitude of stakeholder needs from "minority browser" vendors, fraud, advertising, publishing, technology and marketing industry sectors – among others - are all considered.

  2. Design – ensure the perceived problems with the existing User-Agent field value are not recreated in the replacement. Turn the "what we had in mind" musings of exceptional engineers into fully defined engineering specifications suitable for all to work with.

  3. Breaking the web – a full study of how the User-Agent [and other similar fields and practices] is used in practice – not in theory or based on individual bias – to inform an impact assessment. Understand the migration approach for each scenario.

There is no burning problem or "innovation impairment" to justify incomplete engineering, poor governance or risky implementation. National regulators do not require this change – and in fact are actively balancing the needs of privacy, business and fair competition.

pluma commented 4 years ago

Just as a reminder for people who think an "anonymous" ID on a request is fine because the GDPR is only about personally identifiable information: if you attach that random ID to the (most likely fairly unique) configuration of the individual installation and then apply the resulting ID to various requests made by the user, the "anonymous" ID becomes "pseudonymous" as you can infer the user's identity from it, making the resulting data personally-identifiable.

Additionally, regardless of the legality and compliance of this, it is clearly a violation of the spirit of information scarcity present in the GDPR and shows a complete disregard for the idea of personal ownership of data and the right to privacy by default.

fredgrott commented 4 years ago

Google, I think you can do better than this and it might hamr certain US state AG meetings about Google anti-trust issues. Please rethink!

joneslloyd commented 4 years ago

This is scary stuff for those living in oppressive countries who aren’t tech-savvy enough to use a proxy and change these settings..

fightborn commented 4 years ago

I'm just here to read advanced excuses from people thinking that other smart guys reporting the issue here are idiots. You do realize that Chrome is becoming worse plague than Internet Explorer ever was?

markentingh commented 4 years ago

You could easily install a Chrome extension for modifying request headers and block the x-client-data header.

AdamMurray commented 4 years ago

@markentingh Privacy shouldn't be reserved for those with the knowledge to modify request headers.

bhartvigsen commented 4 years ago

You could easily install a Chrome extension for modifying request headers and block the x-client-data header.

So smart people get privacy, and the grandmothers of the world don't?

dbaron commented 4 years ago

While the subject of the X-Client-Data header seems peripherally relevant to this issue, detailed discussion and advocacy focusing on that header (rather than the subject of freezing User-Agent and making the equivalent information available via Client Hints) seems to me to be off-topic for this issue, which is a request to the W3C's Technical Architecture Group to review the latter subject.

torgo commented 4 years ago

A reminder to anyone posting on this issue: we do encourage public participation in our TAG review discussions, but please keep on topic and please ensure that you adhere to the W3C code of conduct.

annadane commented 4 years ago

I really hope you understand that as a company, what you say to your users publicly versus what gets discussed internally development wise, within your team or here on github, are two very different things. Don't lie to your customer. Don't pretend it's ok.

awilfox commented 4 years ago

I fail to see how this would significantly impact privacy or fingerprinting since Client Hints will enable fingerprinting equivalent or even deeper than what User-Agent currently provides. However, I can easily see how this would significantly strengthen the monoculture that Chrome / the Blink engine enjoys on the modern Web at the expense of all other browsers/engines. I think that this would be an extremely negative change for user choice and diversity.

bhartvigsen commented 4 years ago

A reminder to anyone posting on this issue: we do encourage public participation in our TAG review discussions, but please keep on topic and please ensure that you adhere to the W3C code of conduct.

The guy is pushing for this change that benefits his trillion-dollar company at the expense of his competition. This IS on-topic. Google's tracking activities are directly, specifically relevant to this desired change. If you guys don't want to take into account the totality of facts in your discussions and decisions then that's fine but don't pretend you're doing otherwise.

dbaron commented 4 years ago

I fail to see how this would significantly impact privacy or fingerprinting since Client Hints will enable fingerprinting equivalent or even deeper than what User-Agent currently provides

I think the key point in terms of effect on fingerprinting is that User-Agent is passive fingerprinting surface while Client Hints is active fingerprinting surface. Moving the same information from being available via active fingerprinting rather than passive fingerprinting serves the goal of making fingerprinting detectable (see the "Detectable Fingerprinting" item), which I think is a worthy one when the use cases for exposing the data are strong enough that it doesn't make sense to remove the data exposure completely. (I'd also note that most of the information in the User-Agent string is detectable in other ways, e.g., through browser feature detection or other mechanisms, but those ways are less reliable, especially when applied to unknown future browsers or to smaller-share browsers that the author of the detection didn't consider.)

That said, there are other concerns raised here that I share: it is clearly somewhat disruptive to existing practices (although it doesn't seem likely to break existing content directly), it's unclear what the effects on minority browsers will be (although I think it could be either positive or negative), and in the past I've expressed concerns with other aspects of Client Hints (although mostly focusing on whether particular features should or shouldn't be detectable through Client Hints, rather than the mechanism itself).

kralos commented 4 years ago

This is scary stuff for those living in oppressive countries who aren’t tech-savvy enough to use a proxy and change these settings..

to clarify, the installation id being sent by the browser circumvents any privacy gained from a proxy / VPN or even routing over Tor

baybal commented 4 years ago

@yoavweiss you are being severely downvoted

JasSra commented 4 years ago

This is not the first time happening , and ain't sure gonna be the last one. Trust is a two way street i suppose, doesn't work one way.

Resolve: Do not support those technologies which do not respect you. Ditch Chrome !

nt1m commented 4 years ago

As @jyasskin pointed out, the examples there should clarify what we had in mind on that front

As for the Zamzung Zinternet case, I'd expect it to send out a set that looks something like Sec-CH-UA: "Chrome"; v="70", "Chromium"; v="70", "Zamzung Zinternet"; v="10". That would enable sites that haven't bothered testing on it to consider it the same as they consider other Chromium browsers, enable sites that want to target it specifically, and will enable analytics (that are aware of it) to understand what specific browsers those users are coming from.

Does that help alleviate your concerns?

Also, note that there's a discussion on maybe putting more emphasis on the engine vs. the UA brand: WICG/ua-client-hints#52

To me, the Zamzung Zinternet case sounds like a good example in which we should prefer the current spec, over switching over to sending only the engine by default.

I think this will just lead to the point where all browsers will ship with Sec-CH-UA: "Chrome"; v="70", "Chromium"; v="70" in the set, including other browser vendors like Gecko, due to poor industry practices. This may become meaningless like Mozilla/5.0 is in the current UA string... So honestly, no, it does not address this concern to me at all.

An equivalent API will enable access to that information on the client. Access to low-entropy information will be synchronous, while access to high-entropy one will be through a Promise. (to enable browsers to take their time when considering if the site should really be granted to potentially fingerprintable info)

Can all of these "low-entropy" APIs be async ? It provides more flexibility for browser vendors (not just for privacy, but also for implementation) and doesn't necessarily add much effort for websites.

By default, the browser will send Sec-CH-UA and Sec-CH-UA-Mobile headers to enable most cases of content negotiation. As those headers are low-entropy, we can afford that trade-off, privacy-wise.

I also don't see a compelling reason to send those out by default, can't those be opt-in like all the others ? Websites receive less information by default, but can still opt-in into it, and it doesn't necessarily have to detriment the user experience as browser vendors may decide to grant that permission without asking the user.

withinboredom commented 4 years ago

I don’t see why we need a string at all. Why can’t a bit mask work for this: say 128 bits, the first 32 represent the “version” of the remaining 96 bits. An ideal browser would return a solid set of 96 1s for a given version of features. You probably don’t need this many bits, but I’m just using it as an example.

Say version 1, might include 32 bits of proposed spec, and 32 bits of experimental spec and 32 “recent” features.

mcatanzaro commented 4 years ago

One specific concern have about this proposal has to do with how "minority browsers" are impacted.

I have some experience with constructing user agent strings for a minority browser (Epiphany, using WebKitGTK). I see a lot of concern in this thread as to how the proposal could impact minor browsers, but not much explanation of why that impact might be negative.

it's unclear what the effects on minority browsers will be (although I think it could be either positive or negative)

I have no doubt: the impact will be positive. Very positive. In particular, Yoav's proposed change will make it harder for Google to screw over small browsers, something it has previously done to us many times in the past. (And continues to do to this day. I have a pending TODO to try to figure out a new user agent quirk for Google Docs. It's not easy.) There are some details of the UA-CH proposal that seem problematic to me (details below), but overall it seems like a big step in the right direction. This change will help small browsers deal with the compatibility issues caused by websites' abuse of the user agent string. As the maintainer of the Epiphany web browser, I'm confident Yoav's proposal will help us far more than it helps Chrome.

And yes, from the PoV of non-dominant browsers, who often need to maintain support for their development by citing usage numbers, I don't think it would be a good thing to suppress the the browser name.

For a non-dominant browser, it is absolutely essential to send a user agent that matches a dominant browser as closely as possible. A small browser cannot be web compatible otherwise. Constructing a viable user agent string for a small browser is hard. The user agent header is extremely fragile and difficult for a non-dominant browser to get right. I have spent far too long experimenting with fairly small changes to the user agent, and also experimenting with WebKit's user agent quirks list (which is absolutely required, because it is impossible for a non-dominant browser to select a user agent that will work for all major websites). Even appending the browser's name to the end of the user agent -- which is relatively safe -- creates risks; I have pending to investigate whether we should stop doing that.

Yoav's proposal is designed to give small browsers a better chance, not to favor Chrome. The status quo of the user agent string favors Chrome, and is in fact absolutely brutal for small browsers. Please read that linked thread in full if you have any doubt that the status quo is awful for small browsers. Years ago, I wrote:

User agent is an extremely demotivating, never-ending game, and it's by far our biggest web compatibility problem. It almost feels as if Google is deliberately trying to break WebKit, which I know is not true as they don't care either way about us... but they do know full well that basing logic off of user agent checks serves to harm less-popular browsers, so it's hardly unintentional. I cannot think of any aspect of WebKit development less gratifying than maintaining our user agent quirk list, nor any bigger user agent offender than Google.

The situation has not improved since then.

Now, we can debate the details of how exactly Sec-CH-UA would work. The current spec actually exposes far too much IMO; eventually, it could become just as problematic as user agent strings currently are. (I've had to hardcode fake user agents for websites that blacklist FreeBSD users, or websites that treat ARM laptops like smartphones just because they see "arm" in the user agent. Creating a standardized mechanism for websites to do this is a mistake.) But that's orthogonal to this issue.

In addition to what Yoav has already proposed, I would also very much like to see the string "Chrome" completely removed from Chrome's user agent, despite the short-term web compat issues that would cause. If that's out of the question, I'd love to see the version number frozen at least. Playing catch-up with Chrome user agents is very frustrating.

TL;DR: We need the full os and browser version to survive as a small ad network. And more importantly, we need this data as part of every first http request, without being discriminated against for being a less frequently visited / smaller website.

The user agent is very difficult due to such competing interests. Small browsers need servers to either not have access to this information, or to have access only to fake frozen versions of this information, and we need collaboration from large browsers to make this possible. As far as we're concerned, any web server that so much as looks at the user agent string is evil. Revealing OS, browser, or architecture information allows websites to block us and makes it very hard to compete with Chrome.

My TL;DR: thank you Yoav, and thank you Chrome developers, for this very serious proposal to make things easier for small competing browsers.

mcatanzaro commented 4 years ago

The user agent is very difficult due to such competing interests.

I suppose a wild proposal would be to allow most websites to receive accurate information, except websites that abuse this privilege by blocking small browsers. In this fantasy proposal, when Epiphany users discover a website blocking them or otherwise degrading the user experience, we would report the issue to Chrome, and Chrome would update its own quirks list to send a fake Epiphany user agent just to the affected website in order to intentionally break that website for all Chrome users, to force the website to fix the issue. That way, small ad networks and websites that want to use the user agent for non-abusive purposes could continue to do so, without harming small browsers. It's not a very serious proposal, because it would require Chrome devs to intentionally break major websites for Chrome users, but I don't know what else would satisfy both web developers and small browsers.

ocram commented 4 years ago

I really fail to see the advantages of this proposal (outweighing the downsides):

It seems the most reasonable (and by all means simplest) solution may be freezing more and more parts of the UA string, and relying on explicit feature detection otherwise.

And the only two upsides here may be the added dimensions in a structured format that website operators could use for content negotiation. But that is only an upside for usability, and yet again a clear negative for privacy, which was one of the original goals, and can only be guaranteed through added complexity and an unlevel playing field; and perhaps turning passive fingerprinting into active fingerprinting – for those who are dependent on fingerprinting based on UA strings because they don’t yet have vast amounts of other information and activity records.

The fundamental incentives and dynamics are that website operators will always want to know what browser and version a client is using exactly, and will therefore reverse-engineer browser vendors’ implementations, while browser vendors will always try to prevent this to defend the UX of their users while browsing the web. This is by definition a game of cat and mouse that is bound to repeat with a different implementation, which is proposed here. The new concept would probably reset expectations (and implementations) for a little while – and then let the same things happen again. But not without weakening competition at the same time – between those who already have everything they need to fingerprint devices or will receive higher-entropy information as trusted sites or already-visited hosts, and those who do not.

Steve51D commented 4 years ago

I have some experience with constructing user agent strings for a minority browser (Epiphany, using WebKitGTK)

Thanks, it's great to get a viewpoint from the that side after all the discussion on here.

I suppose a wild proposal would be to allow most websites to receive accurate information, except websites that abuse this privilege by blocking small browsers. In this fantasy proposal, when Epiphany users discover a website blocking them or otherwise degrading the user experience, we would report the issue to Chrome, and Chrome would update its own quirks list to send a fake Epiphany user agent just to the affected website in order to intentionally break that website for all Chrome users, to force the website to fix the issue.

I like it in theory but I doubt that's ever going to happen in practice. This proposal highlights the fundamental problem though and that problem is not the User-Agent string itself but websites misuse of it.

I think CH could alleviate some of the unintentional misuse that occurs but the only way you can really stop that misuse is not to send the data in the first place. That opens up a whole new can of worms as there are many use-cases that rely on that data. Removing the viability of those use-cases just to try and solve an issue with what are essentially, poorly coded websites seems misguided.

mcatanzaro commented 4 years ago
  • Are you convinced that GREASE will help? Perhaps in the very short term, but that’s it. Do you expect website operators to not recognize that only NotBrowser and Foo are being mixed in randomly, while Epiphany is indeed a safe sign that the browser at hand is Not Real Chrome? Will Chrome actually start mixing in real browser names and versions to give teeth to GREASE? No, Chrome won’t. That would defeat the purpose of the whole feature after all, because now you can never be sure what browser the client is using.

Your comment is thoughtful and well-considered. This is the only point where I don't agree, but it's a critical point. The GREASE will help a lot, and in fact I'd say it's key to the entire proposal. I'm not worried about websites searching for "Epiphany" and blacklisting it; my worry is sites searching for anything non-Chrome and blacklisting that. (Even if websites actually specifically try to blacklist us, removing "Epiphany" from the UA-CH list would be no big deal.) If Chrome greases with some truly random values, that will be a huge benefit to other browsers. If Chrome is randomly removed from the list, as proposed in the spec, that will help a ton as well. (Say it includes "Chrome" for half of page loads; then you can still gather accurate usage statistics by multiplying by two.) Chromium browsers would benefit hugely by the switch from "Chrome" to "Chromium," (although I'd love to see that randomly removed as well, because Epiphany isn't based on Chromium and we don't really want to wind up in a world where small browsers are OK only if based on Chromium).

The current user agent string cannot be GREASEd.

ImaCrea commented 4 years ago

This situation is maybe one of the greatest recent example to prove why it is absolutely necessary that all software must be free or, at least, open source.

ocram commented 4 years ago

Let me clarify why GREASE will not help at all, @mcatanzaro and @jyasskin:

We all agree that website operators want to identify browsers for feature detection, compatibility checks and statistics. Browser vendors want to have their product identified for a known market share and for statistics. So website operators will find ways to detect the actual browser name and version (if possible in any way), and browser vendors will include the real name of their own product (at least for most requests).

That’s why we’re here, and this is certain to happen again with a different implementation – simply because of the incentives and interests. Look what we have done as a community (i.e. the overall web community) [1] [2] [3] [4]. It will happen again. Hundreds of libraries will help do it.

Today (User-Agent string)

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36

might be identified (in part) via

isChrome = uaString.contains("Chrome/[.0-9]* ") || uaString.contains("CriOS/[.0-9]* ");
degradedExperience = !isChrome;

Sites like Google Docs might be doing something like this today.

Perhaps soon (Sec-CH-UA set)

"Chrome"; v="80"
"Chrome"; v="80", "Chromium"; v="80"

might be identified (in part) via

isChrome = uaSet.contains("Chrome");
greatExperience = uaSet.contains("Chromium") || isChrome;
degradedExperience = !greatExperience;

Perhaps soon (Sec-CH-UA set + GREASE)

GREASE is “likely” to be applied, but optional. It shouldn’t be optional.

Some parts say the plan is to only “[a]ppend additional items” or “[r]andomize the order”.

"Chrome"; v="80", "NotBrowser"; v="12"
"Foo"; v="10", "Chrome"; v="80"
"Chrome"; v="81", "Bar"; v="64", "Chromium"; v="81"

That can be identified in the same way as above. It only solves the problem of websites blocking unknown browsers. But who does that? Websites can still block:

block = !uaSet.contains("Chrome") && !uaSet.contains("Firefox") && !uaSet.contains("Edge") && ...;

Perhaps soon (Sec-CH-UA set + GREASE + drop self)

The Brave and Firefox teams, for example, might want to put in “Chrome” as well, because website operators have added conditions on the presence of “Chrome” again.

Chrome can now prevent this, though. “Chrome might remove itself from the set entirely”. Might. I doubt it will happen, for the reasons outlined above. But it might.

So now all website operators must be responsible citizens of the web and build equivalence classes based solely on the rendering engine:

greatExperience = uaSet.contains("Chromium");
degradedExperience = !greatExperience;

Now “Chromium” is the new “Chrome” and all browser vendors will add it as well. It will defeat a large part of the purpose of Sec-CH-UA once again: Statistics are less meaningful, and not every “Chromium” is like the other “Chromium”. Website operators want to know if it’s Real Chrome, not just something similar with different feature flags or defaults.

And how will website operators detect Real Chrome with these hints?

"Foo Browser"; v="23", "Firefox"; v="72", "Gecko"; v="72"

"Chromium"; v="80", "Bar Browser"; v="35", "Brave"; v="104"

It’s simple:

isChrome = !uaSet.contains("Firefox") && !uaSet.contains("Brave") && !uaSet.contains("Edge") && ...;

Others see the same problems or even further problems [5] [6] [7] [8] [9].

All in all, this variant of GREASE needs to solve different problems from the original GREASE’s. It can’t solve these, and thus it will fail. We’ll be back to where we started, even with GREASE (which will only make things more complicated).

yoavweiss commented 4 years ago

@ocram

GREASEing by adding non-existent browser names would avoid blocking of unknown browsers (which as @mcatanzaro indicated, is a major problem today).

One can also imagine sending invalid headers that would also be correctly parsed by valid Structured Headers parsers, to avoid error-prone regex based "parsing". (e.g. "Chrome"; v="73", "GibberishFirefox 66 dfgdfg")

GREASEing by pretending to be other browsers (to avoid their explicit, intentional blocking) would indeed carry compat risk, so would require further experimentation.

It seems like WebKit today has a list of websites where their UA strings lie to for their own compat benefit. One could imagine a distant future where browsers would keep a similar list on which they perform targeted GREASEing (somewhat similar to what @mcatanzaro suggested) in order to dissuade known compatibility offenders from their practices. Enabling GREASEing in the first place seems like a good first step in that direction.

torgo commented 4 years ago

@yoavweiss you wrote above:

Beyond the privacy benefits of this change, it has an explicit goal of discouraging unreliable UA sniffing, as well as problematic UA sniffing patterns such as allow and block lists. So its intent is to discourage patterns that harm browser diversity.

You've stated that, indirectly, the goal is to increase browser diversity. If we agree that browser diversity is a good thing, could an explicit goal to be to increase browser diversity, not indirectly but directly?

I remain concerned about the ability for non-mainstream browsers to be able to measure their reach and the ability for web sites to measure traffic by browser. Two years ago, we (Samsung) worked with the Google Analytics team to get them to split out Samsung Internet traffic from Chrome traffic in the analysis and reports they provide their clients. This had the effect of making it more visible to web site owners, such as UK Government Digital Service, when our browser was being used. This had the impact (see the link) if GDS adding our browser to their testing recommendations – which in the end benefits end users of UK government services. (Also see their note on that blog post in support of browser diversity.) How would this same story play out in a post-UA world?

jyasskin commented 4 years ago

@torgo you'll want to include a Samsung string in the always-sent UA set. I liked @mcatanzaro's idea to send it a fixed fraction of time to reduce how much that helps sites fingerprint your users.

jyasskin commented 4 years ago

Like today, you'll have to choose between getting credit for your users vs stopping sites from sending your browser special content. There's nothing magic on that front about this proposal, although @mcatanzaro's idea might help some.

immanuelfodor commented 4 years ago

Sites sending your browser special content can be a desired feature sometimes, for example, Emby transcodes the AC3 audio for Firefox which is ultimately broken when streamed if you fake the UA string to appear as Chrome (personal experience). If any of the above proposals can solve such cases with e.g. an opt-in feature to "unfreeze" the UA string for better compatibility on a user-edited whitelist of sites that are not yet ported to the new method(s), I'm all in for better privacy.

mgol commented 4 years ago

@yoavweiss

GREASEing by adding non-existent browser names would avoid blocking of unknown browsers (which as @mcatanzaro indicated, is a major problem today).

It's true this would avoid blocking unknown browsers. What browsers are unknown, though, depends on the tools you use. As @ocram rightly noticed, there are incentives to detect even minor browsers (e.g. for analytics purposes) which means tools are created that make such detection possible and these same tools are then used to apply some fixes, enable some features, etc. only to some browsers.

When you use a browser-detecting library that knows about minor browsers, this library will ignore tokens like "NotBrowser" but it will take tokens like "Vivaldi" into account.

Vivaldi has recently stopped using its own token in the user agent string for this very reason. They wrote a blog post showing a few examples of how Google sites were serving a degraded experience to Vivaldi when the Vivaldi token was present in the UA. One example is google.com where input text appears outside of the input frame. I repeated the test locally and the site was broken with the Vivaldi/2.10.1745.27 token but it worked correctly when I changed it to Vivald/2.10.1745.27 (i.e. I just removed i). It's clear, then that it's not just that any extra UA suffix would break the site; it was specifically singling out Vivaldi.

These issues wouldn't exist if sites were targeting engines by default instead of browser names when applying changes related to engines' APIs. Since even Google often does it by browser, it's hard to expect companies with less cash to spend time on making sure they're not singling out minority browsers.

One can also imagine sending invalid headers that would also be correctly parsed by valid Structured Headers parsers, to avoid error-prone regex based "parsing". (e.g. "Chrome"; v="73", "GibberishFirefox 66 dfgdfg")

This won't solve the case I described above.

One could imagine a distant future where browsers would keep a similar list on which they perform targeted GREASEing (somewhat similar to what @mcatanzaro suggested) in order to dissuade known compatibility offenders from their practices. Enabling GREASEing in the first place seems like a good first step in that direction.

If such a strategy was applied against known offenders it'd have to be done carefully as I can't imaging browser makers willingfully breaking existing sites of these offenders. Also, note that these offenders list would have to include Google today so Chromium would have to fight against the company that governs the project. It's hard to imagine it happening.

ocram commented 4 years ago

So we seem to agree that preventing problems with unexpected entries is the only thing that GREASE solves.

Therefore, the absurd accumulation of complexity and size, and the lies about browser identities, are things that will quickly happen again, as described in detail above – because the proposal has nothing in it to stop this and the incentives all remain the same.

Finally, as @mgol said, I can’t see popular browsers start to intentionally lie about their identity for the greater good, especially not Google Chrome lying to Google Search, Google Docs, YouTube, etc. If popular browsers wanted to lie about their identity for the greater good, they could have been doing this already for a long time.

Steve51D commented 4 years ago

If GREASE creates more problems than it solves then you are left with the question of what to do about the underlying problem it is trying to solve. This primarily seems to be an issue for browser developers, some of whom advocate removing the User-Agent or Sec-CH-UA entirely. There are also privacy campaigners who want it removed as well.

There are several issues down that road but I think that one of the most critical is that it puts far more power into the hands of the dominant browser. I.e. Google.

The fact that Google themselves have added additional tracking into Chrome to go beyond what User-Agent allows shows the value of this kind of information. This x-client-data header is only sent to Google websites so only Google have access to that data.

If the browser were not identifiable in the request then it would just mean that Google would be the only ones with a picture of the browser landscape rather than one of many because they are in the unique position of having a huge share of the browser market as well as enough big website properties to funnel data through.

I think that browser developers are just going to have to continue dealing with this problem of incompatible websites as they come up. I'm sure that's a very frustrating position to be in but the alternatives seem far worse for everyone else.

ghost commented 4 years ago

you should not have to be tech-savvy to prevent tracking or data gathering, companies should not be able to gather data or track you without CLEAR ad EXPLICIT opt-in and provision to ensure that you at any time can request the removal of all your data they hold. It is unfortunate that the approach of companies is that once you allow them to gather data that the data gather belongs to them. Companies at best have permitted use of data from the user when the opt-in and once cancel so it the permission to use it. if companies want to keep that data when a user opts out, there should be an explicit request from a company to the user if they are permitted to keep using already captured data.

All of this should be implemented with the intended user being, a none-tech person and is easy and correctly displayed of information.

scottlow commented 4 years ago

As @yoavweiss mentioned above, issue #52 in the UA Client Hints repository attempts to summarize much of the ongoing debate here, particularly around GREASE.

My concern with browsers pretending to be other browsers some fixed percentage of the time is twofold:

While we certainly ran into a few sites that blocked the new Edge based on the fact that it had an unknown "Edg" token (web.whatsapp.com was one example), the far more common cause of breakage that we encountered was from sites that started detecting our "Edg" token as a unique browser, but failed to update their per-browser allow lists to include the new Edge. As @mgol mentioned above:

These issues wouldn't exist if sites were targeting engines by default instead of browser names

While I admit that exposing engine by default and letting sites opt into receiving brand information using Accept-CH: UA does not address the issues of enabling allow/block lists being created (at least not without some discouragement from opting into additional client hints via something like Privacy Budget), my hypothesis is that it would encourage site developers to build allow lists off of well-defined equivalence classes, thus reducing the number of compatibility issues caused by allow lists constructed from per-browser identifiers.

ocram commented 4 years ago

Do you think that with engines instead of browser brands, website operators will suddenly all become responsible citizens of the web? That is, engines will not just be the new browser brands when it comes to browser identification?

If my browser has CustomEngine, but sites restrict certain features or serve a degraded experience due to that information, my browser will either send CustomEngine (Chromium) or "CustomEngine", "Chromium" or "Chromium"; version=80, "CustomEngine"; version=28.

Again, website operators may exclusively rely on true equivalence classes and everything may be great. But why should anything be different with Sec-CH-UA and Sec-CH-UA-Engine instead of User-Agent? With regard to the incentives and underlying problems, nothing has changed.

By the way, as for randomly returning different values (e.g. in 25 % of all cases), I think it’s obvious that this won’t work for the use cases that make User-Agent something that people rely on today. It’s the same situation as with including fake brands or dropping oneself from the set.

scottlow commented 4 years ago

Do you think that with engines instead of browser brands, website operators will suddenly all become responsible citizens of the web?

Nope. I will readily admit that both the Sec-CH-UA-Engine and Sec-CH-UA proposals suffer from the fact that there are no technical provisions in place to prevent allow/block lists from being created as they can be from the User-Agent today.

My main point is that exposing both brand and engine in a single hint doesn't encourage developers to change their behavior in any way for the better of compatibility. We can provide guidance encouraging them to target true equivalence classes by default, however providing both brand and engine in a single hint feels an awful lot like providing per-browser identifiers in the User-Agent header today, but recommending that feature detection be used instead.

By only exposing Sec-CH-UA-Engine by default, we are at least adding a hurdle (in the form of having to opt in to receiving brand information) between sites and per-browser identifiers.

ocram commented 4 years ago

I agree that the separation of brand and engine is reasonable. It’s just that the hope for better usage by the community in the future is not a strong argument, and responsible developers could already do today what they should do in the future, i.e. rely on engines instead of brands where possible.

Turning passive fingerprinting into (detectable) active fingerprinting and offering information selectively is good as well. While most sites will request similar information and there won’t be much variation that could allow you to detect bad actors, this is still the strongest point of the proposal, I’d say.

But I really don’t think it will change anything about the complexity and length of strings (or sets), so maybe we should not put too much hope into that and avoid making the proposal more complex to make those dreams possible. It will not work.

All in all, it doesn’t appear to be a strong case for this new proposal replacing the current string where both have similar power and will suffer from similar problems. In the end, you will either have to support frozen old values forever or ultimately break backward compatibility.

ronancremin commented 4 years ago

I have some comments on the proposal in a few different areas. Some of these points have been made already but I nonetheless want to restate them.

Lack of industry consultation The HTTP protocol has become deeply embedded globally over its lifetime. As envisaged by the authors of the HTTP protocol, the User-Agent string has been used in the ensuing decades for “statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations”.

The User-Agent header has been part of the web since its inception. It has been stable element of the HTTP protocol through all its versions from HTTP 1.0 in 1996 all the way to HTTP/2 in 2015 and thus has inevitably come to be relied upon, even if particular use cases are not apparent, or have have been forgotten about, or its practitioners are not participants in standards groups. The User-Agent string is also likely being used in new ways not contemplated by the original authors of the specification.

There was a salutary example of the longevity of standards in a recent Tweet from the author of Envoy, a web proxy server. He has been forced to add elements of HTTP 1.0 to ensure it works in the real world, despite Envoy’s development starting 23 years after HTTP/1.1 was ratified and deliberately opting not to support HTTP 1.0. This is the reality of the web—legacy is forever.

Despite this reality, there is no public evidence of any attempt to consult with industry groups to understand the breath and severity of the impact of this proposed change to HTTP. It is a testament to its original design that the HTTP protocol has endured so well despite enormous changes in the internet landscape. Such designs should not be changed lightly.

Issues with the stated aim of the proposal The problem with the User-Agent string and the reason to propose Client Hints, per the explainer, is that “there's a lot of entropy wrapped up in the UA string” and that “this makes it an important part of fingerprinting schemes of all sorts.”

In subsequent discussions in the HTTP WG the privacy issues focused on passive fingerprinting, where the User-Agent string could potentially be used by entities for tracking users without their knowledge.

What is missing from the discussion is any concrete evidence of the extent or severity of this supposed tracking. Making changes to an open standard that has been in place for over 24 years should require a careful and transparent weighing of the benefits and costs of doing so, not the opinion of some individuals. In this case the benefits are unclear and the central argument is disputed by experts in the field. The costs on the other hand are significant. The burden of proof for making the case that this truly is a problem worth fixing clearly falls on the proposer of the change.

If active tracking is the main issue that this proposal seeks to address there are far richer sources of entropy than the User-Agent string. Google themselves have published a paper on a canvas-based tracking technique that can uniquely identify 52M client types with 100% accuracy. Audio fingerprinting, time skew fingerprinting and font-list fingerprinting can be combined to give very high entropy tracking.

Timeline of change This proposed change is proceeding more quickly than the industry can keep up with. In January 2020 alone there were some important changes made to the proposal (e.g. sending the mobileness hint by default). It is difficult to fully consider the proposal and understand its impact until it is stable for a while. The community needs time to 1) notice the proposal and 2) consider its impact. There has not been enough time.

Move fast and break things is not the correct approach for making changes to an open standard.

Narrow review group It’s difficult to be objective about this but the group discussing this proposal feels narrow and mostly comes from the web browser constituency, where the change would initially be enacted, but the impact not necessarily felt. It would be good to see more people from the following constituencies in the discussion:

All of these constituencies make use of the User-Agent string and must be involved in the discussion for a meaningful consensus to be reached.

Obviously you can’t force people to people contribute but my sense is that this proposal is not widely known about amongst these impacted parties.

Diversity of web monitisation Ads are the micropayments system of the web. Nobody likes them but they serve a crucial role in the web ecosystem.

The proposed change hurts web diversity by disproportionally harming smaller advertising networks that use the OpenRTB protocol. This essentially means most networks outside of Google and Facebook. Why? The User-Agent string is part of the OpenRTB BidRequest object where it is used to help inform bidding decisions, format ads and targeting. Why does it hurt Google less? Because Google is able to maintain a richer set of user data across its dominant web properties (90% market share in search), Chrome browser (69% market share) and Android operating system (74% market share).

The web needs diversity of monetisation just as much as it needs diversity in browsers.

Dismissive tone in discussions Some of the commentary from the proposers has been dismissive in nature e.g. the following comments on the Intent to Deprecate and Freeze: The User-Agent string post in response to a set of questions:

Entire constituencies of the web should not be dismissed out of hand. This tone has no place in standards setting.

Entangling Chrome releases with an open standards process In the review request, Chrome release dates are mentioned. It doesn’t feel appropriate to link a commercial organisation’s internal dates to a proposed standard. There are mentions of shipping code and the Chrome intent.

Overstated support This point has been made by others here but it is worth restating. It feels like there is an attempt to make this proposal sound as if it has broader support than it really does, in particular on the Chrome intent, linked explicitly by the requester.

Unresolved issues The review states “Major unresolved issues with or opposition to this specification: “ i.e. no unresolved issues or opposition. This is true only if you consider unilaterally closed issues to be truly closed. Here are a couple of issues that were closed rather abruptly, and coinciding with a Chrome intent.

Some closed HTTPWG issues: