google / sxg-rs

A set of tools for generating signed exchanges at serve time.
Apache License 2.0
84 stars 20 forks source link

Add User-Agent sniffing #395

Closed twifkak closed 2 years ago

twifkak commented 2 years ago

Chromium M73 enabled SXG, but before this commit landed in M79, it had an Accept header that matches the PrefersSxg logic that is meant to differentiate SXG crawlers from SXG browsers, per this recommendation.

Without User-Agent sniffing to rule out Chromium M73-78, sxg-rs may result in broken experiences on those browsers on sites that use a service worker caching strategy, due to Chromium not interpreting SXG served from a SW. Those browsers represent ~0.19% of users.

(This doesn't happen on Cloudflare Automatic Signed Exchanges because they've implemented this UA sniffing downstream.)

twifkak commented 2 years ago

This should be a "denylist" approach -- allowing all user agents except those that match Chromium 73-78.

twifkak commented 2 years ago

I found a few UA parsers in crates, in descending popularity:

We should rule out the huge ones, since we have serverless bundle size constraints.

Perhaps a regex will suffice for our needs? Famous last words.

quangIO commented 2 years ago

I think a small heuristic without regex should be the way to go. For example, find Chrome/ or Chromium/ then get the next 2 digits to see if it is in [73, 78] range.

twifkak commented 2 years ago

Perhaps I'm a bit too forward-looking but there may eventually be a Chrome 730. :wink:

Are you concerned around new dependencies because of bundle size, security, performance, or something else?

In general, I prefer accurate parsers because inaccurate ones lead to weird ecosystem effects:

  1. sites begin to target bugs in the parsers, either deliberately (e.g. the IE star hack) or accidentally, and then
  2. changes to the standard need to keep in mind backwards-compatibility with these hacks, and then
  3. the semantics get progressively more complex (e.g. HTML adoption agency algorithm and others leading to cases where you can construct invalid DOMs)

That said, I'm also a pragmatist. In particular, sxg-rs is a drop in the bucket -- it probably won't materially effect the destiny of User-Agent. :grin: So if a small heuristic is preferable, that's fine.

antiphoton commented 2 years ago

user-agent-parser, uap-rust and fast_uapperser are all internally using regular expressions defined in uap-core.

Picking the Chrome-related regex in uap-core is suffice to parse the Chrome major version.

However, adding regex crate as dependency increases the gzipped WASM size from 737 kB to 954 kB.

twifkak commented 2 years ago

Oh, thanks for the investigation @antiphoton. I overestimated the degree to which UA strings can be accurately parsed. I'll withdraw my earlier comment (except to be mindful of more than two digits).

Maybe something like " Chrome/" or " Chromium/" (note the leading space) followed by any number of digits? No need for regex then, I suppose? std::string or nom will probably suffice?