Privacy Review: handle start_url tracking

mounirlamouri commented 9 years ago

To summarize @npdoty comments in https://lists.w3.org/Archives/Public/public-privacy/2015JanMar/0117.html there are concerns about start_url containing special ids or simply something that hints that the user is coming from a homescreen application. This is fingerprinting/privacy sensitive information that the user might not be aware of.

I think the issue of people doing start_url: 'index.html?from_homescreen' is something we might want to mention in the spec but I don't think we should encourage browsers to prevent this because it is clearly something websites want for various reasons (mostly statistics).

However, I am concerned about having start_url: 'index.html?$GUUID' because it is a way to track the user without them being aware of it. I'm not sure what the spec should say or the browsers could do. Maybe we could recommend showing the start_url to the user and allow them to edit it?

mounirlamouri commented 9 years ago

/CC @paulkinlan

kenchris commented 9 years ago

I think you want to avoid showing the URL to users by default. I would assume that it might frighten some and most don't know what to do with it.

What you could do is that if a URL contains ?, you might show a small information text and an edit button for super users. But I wouldn't show that by default if the url doesn't even contain '?'

lknik commented 5 years ago

Correct me if I'm mistaken. But is throwing the problem on users the recommended solution to the security/privacy issues of here ? :)

marcoscaceres commented 5 years ago

We will look at mitigation strategies on the Firefox side and make some suggestions:

https://bugzilla.mozilla.org/show_bug.cgi?id=1542898

aarongustafson commented 5 years ago

It's possible the recommendation could be that UAs strip any query string (or fragment identifier) from the URL when launching, but there are likely legit, non-privacy-invasive uses for these as well (e.g., language preference).

I guess my question would be whether this particular potential abuse vector—a dynamic start_url—creates a unique opportunity to gain information about a particular user that cookies, localStorage, indexed DB, and the cache API—many of which PWAs are already likely to use—don't already provide. If it does, then let's absolutely address it. If not, any mitigations we do would be relatively easy to circumvent via other means. For instance, if you want to know if the site is being viewed as a PWA or a browser tab, which would be a relatively good indication you're coming from a home screen or start menu, you can test the display-mode media query. And there are other APIs being discussed that might only become available if/when you are installed.

To be clear, I'm not dismissing this as a concern; it very well may be a big privacy hole. I would just like to know if the privacy concerns we identify would be unique to this particular case.

dominickng commented 5 years ago

The meaningful difference with existing vectors is that they can all be explicitly cleared by the user (e.g. by clearing cookies and site data in Chrome and equivalents in other browsers). A query parameter on the start_url could hold a unique ID that survives such a clearing.

For instance, one way to solve this is to clear query parameters from start_urls when users clear site data.

reillyeon commented 5 years ago

Can't a site embed a tracking ID in the path as easily as the query parameters?

lknik commented 5 years ago

It's possible the recommendation could be that UAs strip any query string (or fragment identifier) from the URL when launching, but there are likely legit, non-privacy-invasive uses for these as well (e.g., language preference).

I made a study and indeed most use of parameters are legit:

1672 pages include a manifest.json
828 use a dedicated start_url
274 use parameters
None appear to use randomly generated identifiers

I guess my question would be whether this particular potential abuse vector—a dynamic start_url—creates a unique opportunity to gain information about a particular user that cookies, localStorage, indexed DB, and the cache API—many of which PWAs are already likely to use—don't already provide.

The points I raise are mostly: there is no way to manage these identifiers, the use of them is not transparent, and they allow respawning others (i.e. if user removes cookies, they can be brought to life).

lknik commented 5 years ago

Can't a site embed a tracking ID in the path as easily as the query parameters?

Yep, you can absolutely generate a per-user page in the start_url. That would be functionally equivalent (so stripping parameters is not a 100% solution), but I did not use this particular thing.

aarongustafson commented 5 years ago

@lknik I believe we have data from our Bing crawler around manifest usage (over a million). Let me ask if we can do a little deeper digging as well.

lknik commented 5 years ago

@lknik I believe we have data from our Bing crawler around manifest usage (over a million). Let me ask if we can do a little deeper digging as well.

Would be interesting. But do you have data on the actual start_url's used? If so, would be happy to get see how it looks at this scale.

aarongustafson commented 5 years ago

Would be interesting. But do you have data on the actual start_url's used? If so, would be happy to get see how it looks at this scale.

I believe we have the full manifests. I need to verify though. It may be a week or so before I get word.

lknik commented 5 years ago

@aarongustafson well a package with loads of manifests and their url's would be a nice present ;)

npdoty commented 5 years ago

To build on @dominickng's suggestion, I think one option is to explicitly consider the start_url to be like any other local state. We know from our experience with evercookies that all local state needs to be cleared simultaneously in order to provide the user anything like what they're trying to ask for. That would suggest that when you "clear local state" on an "installed" web app, that you re-load the app entirely. This could be a UX challenge for implementers, but it shouldn't be entirely impossible in the Web context: you'd effectively send the user back to the page in the browser (with a clean cookie jar) and trigger the 'installation' again, which could be pretty seamless. If the PWA wants the user signed in before they 'install', then they'd get back to the sign-in page, which is what the user should be seeing if they tried to clear local state and the app contains an authentication cookie or similar.

Alternatively, we could tell sites that they shouldn't use manifest data that is customized to the user in any way, and start work on the challenging problem of automatically identifying sites that are customizing start_url (or perhaps other parameters) and reporting them / blocking them so that users can be warned.

See this guidance on identifying local state mechanisms so that they can be cleared: https://www.w3.org/TR/fingerprinting-guidance/#clearing-all-local-state

marcoscaceres commented 5 years ago

To build on @dominickng's suggestion, I think one option is to explicitly consider the start_url to be like any other local state.

Depends on a few things:

if the browser creates a shortcut on the desktop or whatever, then the browser might no longer be in control of the short cut (e.g., the user moves the shortcut to another folder).
The case of adding to home screen is akin to bookmarking: basically when you bookmark something, a user may be inevitably capturing their own unique identifier (e.g., "https://example.com/article?userid=123")... I know, this is a "what-about-ism", but it holds because most of us have these kinds of bookmarks in our browsers.

We know from our experience with evercookies that all local state needs to be cleared simultaneously in order to provide the user anything like what they're trying to ask for.

I agree - but this is different from a malicious supercookie (e.g., HTST). There is explicit opt-in to install a web application, and it includes the possibility to inspect the URL. Granted, examining the URL is useless for 99% of people. The mitigation strategy is really to just delete the shortcut to the PWA.

That would suggest that when you "clear local state" on an "installed" web app, that you re-load the app entirely.

Generally yes, I agree - and the data purge should be supported... however, going back to the supercookie attack, I don't see how it helps when the start URL is: "http://example.com?user=123"... you can just restore user123's cookies/state from the server when they open the app.

Alternatively, we could tell sites that they shouldn't use manifest data that is customized to the user in any way,

We can amend: https://www.w3.org/TR/appmanifest/#privacy-consideration-start_url-tracking

and start work on the challenging problem of automatically identifying sites that are customizing start_url (or perhaps other parameters) and reporting them / blocking them so that users can be warned.

Sure.

npdoty commented 5 years ago

Depends on a few things:

if the browser creates a shortcut on the desktop or whatever, then the browser might no longer be in control of the short cut (e.g., the user moves the shortcut to another folder).

Well, even if the user has moved the shortcut, if the user initiates clearing local state when they are active in that browser-run app, then there would need to be some control over it, right? I could see that that might involve coordination between the browser and OS. (Is uninstall functionality included in the spec?)

The case of adding to home screen is akin to bookmarking: basically when you bookmark something, a user may be inevitably capturing their own unique identifier (e.g., "https://example.com/article?userid=123")... I know, this is a "what-about-ism", but it holds because most of us have these kinds of bookmarks in our browsers.

I don't think that's what-about-ism at all; that seems like a totally plausible use case. I personally would like to be able to log in to my email, then click the 'make an app' button (which stores a bearer token or something) and then be logged in whenever I click on my new 'app'. The challenge, in that situation, is either to indicate to the user that clearing local state isn't possible or, when the user does choose to clear state, to get them back to the site with state cleared in such a way that they have to choose to re-create the state themselves (by logging in again, say) before 'installing'.

We know from our experience with evercookies that all local state needs to be cleared simultaneously in order to provide the user anything like what they're trying to ask for.

I agree - but this is different from a malicious supercookie (e.g., HTST). There is explicit opt-in to install a web application, and it includes the possibility to inspect the URL. Granted, examining the URL is useless for 99% of people. The mitigation strategy is really to just delete the shortcut to the PWA.

That would suggest that when you "clear local state" on an "installed" web app, that you re-load the app entirely.

Generally yes, I agree - and the data purge should be supported... however, going back to the supercookie attack, I don't see how it helps when the start URL is: "http://example.com?user=123"... you can just restore user123's cookies/state from the server when they open the app.

Right, that's exactly the attack that we're talking about. If you re-load from the same start_url, then it isn't like re-setting the state. If the user goes back to the bare domain with an empty cookie store and gets the install app workflow again, then that wouldn't be a problem.

Alternatively, we could tell sites that they shouldn't use manifest data that is customized to the user in any way,

We can amend: https://www.w3.org/TR/appmanifest/#privacy-consideration-start_url-tracking

In that case, we'd be saying that the feature doesn't support a bookmark/manifest-install from example.com/article?user=123 and we would want browsers, researchers and others to try to detect that and block it. But is that what we're hoping for? Is it one manifest/app per domain with no customized user state? Or do apps themselves have customized user state in start_url and we need to make that clear to users when they ask to clear local state? I think you could still have the logged-in-webmail experience even if the URL doesn't include user state, if the cookie jar just gets frozen with the app when it's installed, and then log-in and log-out and clear state functionality would all work as users expect.

lknik commented 5 years ago

suggest that when you "clear local state" on an "installed" web app, that you re-load the app entirely. This could be a UX challenge for implementers, but it shouldn't be entirely impossible in the Web context: you'd effectively send the user back to the page in the browser (with a clean cookie jar) and trigger the 'installation' again, which could be pretty seamless. If the PWA wants the user signed in before they 'install', then they'd get back to the sign-in page, which is what the user should be seeing if they tried to clear local state and the app contains an authentication cookie or similar.

Devil's advocate here. Let's assume the user is an avid PWA browser and has, like, 50-100 of these. Then he/she choose in the browser "clear all private data". Would that mean removing 50-100 apps, and require reinstalling/logging in, possibly reconfiguring? That would make the today's experience of clearing data significantly degraded.

Alternatively, we could tell sites that they shouldn't use manifest data that is customized to the user in any way, and start work on the challenging problem of automatically identifying sites that are customizing start_url (or perhaps other parameters) and reporting them / blocking them so that users can be warned.

Thanks for the lengthy reply. I wonder if in the end we won't end up in merging the two anyway (some browser/UI change; indication; researchers/browsers working on identifying misuses)

marcoscaceres commented 5 years ago

@npdoty wrote:

Is uninstall functionality included in the spec?

Yes, and it recommends purging storage, permissions, etc. https://www.w3.org/TR/appmanifest/#uninstallation

Right, that's exactly the attack that we're talking about. If you re-load from the same start_url, then it isn't like re-setting the state. If the user goes back to the bare domain with an empty cookie store and gets the install app workflow again, then that wouldn't be a problem.

I guess the core question is: is the start_url any more of a super cookie than creating bookmark? Both require a user gesture to be saved/installed, both are inspectable, and both can be deleted.

I agree that there is a possibility for a browser to classify and treat a start_url as a tracker, but I don't feel this raises to the level of a super cookie. So, I'm not saying we shouldn't do anything here - but I don't think it's a dire situation.

@lknik wrote:

Devil's advocate here. Let's assume the user is an avid PWA browser and has, like, 50-100 of these. Then he/she choose in the browser "clear all private data". Would that mean removing 50-100 apps, and require reinstalling/logging in, possibly reconfiguring? That would make the today's experience of clearing data significantly degraded.

Sounds like a UX problem, tbh. I could "select all" apps and dump them in the trash... or select a bunch and dump them in the trash. Compare how Firefox and Chrome have "bookmark managers" that provide for sophisticated UIs for managing this problem. One could imagine the same for PWAs.

lknik commented 5 years ago

I guess the core question is: is the start_url any more of a super cookie than creating bookmark? Both require a user gesture to be saved/installed, both are inspectable, and both can be deleted.

Can current pages create unique to-be-bookmarked pages and are they opened without displaying a URL?

I agree that there is a possibility for a browser to classify and treat a start_url as a tracker, but I don't feel this raises to the level of a super cookie. So, I'm not saying we shouldn't do anything here - but I don't think it's a dire situation.

Well it does allow cookie respawn.

marcoscaceres commented 5 years ago

Can current pages create unique to-be-bookmarked pages and are they opened without displaying a URL?

no, as Fullscreen API requires a user gesture.

Well it does allow cookie respawn.

Yeah. 🤔

dominickng commented 5 years ago

I guess the core question is: is the start_url any more of a super cookie than creating bookmark? Both require a user gesture to be saved/installed, both are inspectable, and both can be deleted.

Can current pages create unique to-be-bookmarked pages and are they opened without displaying a URL?

I agree that there is a possibility for a browser to classify and treat a start_url as a tracker, but I don't feel this raises to the level of a super cookie. So, I'm not saying we shouldn't do anything here - but I don't think it's a dire situation.

Well it does allow cookie respawn.

The bookmarks case is an interesting corollary - they offer pretty much the same capability to embed some identifier that's always present even after site data deletion.

To me, the only meaningful difference between bookmarks and installed web apps for this particular case is that installed web apps don't show the URL bar when they're opened from their shortcut. In the bookmarks case, relying on users noticing that there's a unique tracking token in the URL bar seems to effectively reduce to exactly the same problem here - relying on users to inspect the start URL to notice there's a unique tracking token. In both, clearing site data then using the shortcut to reopen the site could allow cookie respawn, and bookmarks have been around for a very long time with this.

We certainly could provide easier ways to inspect the start URL. Perhaps, for instance, we could show the location bar the first time you open an installed web app after clearing data. That seems to reduce back to precisely the guarantees offered by bookmarks in this situation?

lknik commented 5 years ago

To me, the only meaningful difference between bookmarks and installed web apps for this particular case is that installed web apps don't show the URL bar when they're opened from their shortcut. In the bookmarks case, relying on users noticing that there's a unique tracking token in the URL bar seems to effectively reduce to exactly the same problem here - relying on users to inspect the start URL to notice there's a unique tracking token. In both, clearing site data then using the shortcut to reopen the site could allow cookie respawn, and bookmarks have been around for a very long time with this.

I'm not quite sure if PWAs (installed apps) will end up being used in same ways as bookmarked pages. I certainly use the two in different ways (and yes, i am skewed); also of note: I do not particularly think that opening in fullscreen is the normal operating way of bookmarked pages. And as I said, I did not see any site, so far, auto-generating tracking pages that I would be compelled installing, at the same time perhaps also thinking it is useful to enable notifications/push.

We certainly could provide easier ways to inspect the start URL. Perhaps, for instance, we could show the location bar the first time you open an installed web app after clearing data. That seems to reduce back to precisely the guarantees offered by bookmarks in this situation?

That's the current recommendation, but: (1) it's not being followed [well FF does it, somewhat], (2) not sure if users will get it, and I am not convinced there really is 1:1 mapping with bookmarks, in principle.

g-ortuno commented 5 years ago

Curious why you think there is not a 1:1 mapping with bookmarks. Both seem vulnerable to this type of tracking in the same way.

npdoty commented 5 years ago

I think bookmarks is a fine analogy to the capability, but I think user confusion is likely to be very present here. If you're used to "apps" between self-contained software that you download from an app store, or from a web site, it might seem very surprising that state about you at the time that you indicated an interest in installing them is embedded in the app -- even after clicking a button like "clear local state".

Related: deleting a bookmark might be understood as a non-destructive action in a way that deleting an app might not obviously be similar. One option is to frame these installed web apps as bookmark-like things, and UAs can tell users that to clear cookies for this app, we'll just delete it and open up the site's home page in your browser so you can access it again. Or, we can advise sites that apps shouldn't embed user-specific identifiers this way and try to take measures (probably measures outside of a single browser, and they'd be tricky, but possible through some observatory-like thing) to try to detect it and discourage it. But I get the impression we need to decide which is the goal.

I also wonder if some isolation of local state is important here. If I install an app that has an identifier embedded, maybe I can be taught that it's like going to a bookmark where they already know it's me, and the privacy surprise can be limited to those instances, since I'm specifically opening that app. But if the cookie jar is shared, then the risk is that cookies can be re-spawned in such a way that I'm tracked whenever an embedded resource from the same domain is included in any page that I subsequently browse. That's a common experience of Web browsing today, but it would be an extension if items presented to the user as different "apps" also had a shared state, and one that could be persistent just from having "installed" one of them. I think isolating state would be an advantage we could embed in to the design, and it would also substantially limit the risk of surprises from start_url identifiers.

lknik commented 5 years ago

@g-ortuno

The matter is of security/privacy UX, though heavy on the technical side.

While both seem to be vulnerable to this kind of tracking (whether there is a standard way of triggering add-bookmark in a site-controlled manner, that is streamlined with browser UI, is secondary to this reply), and there are similarities between bookmarks and start_url, start_url is in my view part of something bigger (otherwise we would not need it and bookmarks would suffice). If I understand it right, this bigger thing (PWA) is a new experience of web browsing, and I wonder if current users would be accustomed. So it boils to the qualitative change and touches browsing experience. On a more technical level:

bookmarks and manifests are consumed (added) differently
manifests can deliver a packaged site that is full screen, can well mimic a locally installed application, and so makes the user perceive it in a distinct manner than a site that was added as a bookmark
while I can foresee how bookmarks will develop in future (I expect no changes to happen), I am not so sure about PWAs, as it seems to be in motion and really benefits from the new platform additions (Push, Notification, to say the least)

@npdoty

I think isolating state would be an advantage we could embed in to the design, and it would also substantially limit the risk of surprises from start_url identifiers.

Thanks for a lengthy response, agreed, +1'd, and so. 5 cents is: iOS currently isolates PWAs. So the attack/technique/trick I deploy above, does not work on iOS (i.e. UID works, but no cookie respawn). Whether it's due to deliberate planning (@othermaciej?) or sheer luck is, again, secondary here. But that's quite interesting.

aarongustafson commented 5 years ago

@aarongustafson well a package with loads of manifests and their url's would be a nice present ;)

@lknik We are testing this against our data now. It takes a bit of time, but I hope to have results in the next week or two.

aarongustafson commented 5 years ago

I agree that there is a possibility for a browser to classify and treat a start_url as a tracker, but I don't feel this raises to the level of a super cookie. So, I'm not saying we shouldn't do anything here - but I don't think it's a dire situation.

We finished our crawl and I have data, @lknik. We crawled 65k+ URLs and collected 27k+ manifest files. Of those, < 2.5k included a query string in the manifest. I did a manual run-through of that list and to my eyes every instance was tracking the source for analytics purposes (e.g., utm_source=homescreen or similar). This is not to say that this isn’t a potential abuse vector (it is every bit as much as bookmarks are), but this does not appear to be an issue currently.

Here is the data as CSV if you’re interested: https://drive.google.com/file/d/1xM5781ufP7kwB_kX6tGzQ-cd71ubnm4U/view?usp=sharing

Perhaps we can find some way to strike a balance between useful analytics tracking and privacy-violation? I had considered the possibility of disallowing manifests to be requested (or installed) with a query string (thereby disallowing dynamic manifest generation which would likely be a common implementation for tracking), but there are valid reasons you might want/need that. Here are two off the top of my head:

It’s currently one of the only ways to offer dynamic language support within a manifest and
SaaS apps can use it to customize the app icon and other features per tenant.

I’m not sure what the right answer is here, but it seems we do have some time to continue to consider different options.

Another complication (which I don’t remember being discussed above) is that some implementations of installed PWAs share data (cookies, cache, etc.) between PWA instances and the browser that installed them. Windows Store-installed PWAs are sandboxed, but I think every Chromium-based browser—at least currently—has a shared data pool. Would clearing all temporary and persistent data from their browser when they uninstall/reset an "app" be what users would/should expect? It seems if we go that route, we would need to include some strong language to implementors that they need to make users aware of such implications.

lknik commented 5 years ago

I agree that there is a possibility for a browser to classify and treat a start_url as a tracker, but I don't feel this raises to the level of a super cookie. So, I'm not saying we shouldn't do anything here - but I don't think it's a dire situation.

We finished our crawl and I have data, @lknik. We crawled 65k+ URLs and collected 27k+ manifest files. Of those, < 2.5k included a query string in the manifest. I did a manual run-through of that list and to my eyes every instance was tracking the source for analytics purposes (e.g., utm_source=homescreen or similar). This is not to say that this isn’t a potential abuse vector (it is every bit as much as bookmarks are), but this does not appear to be an issue currently.

It's consistent with my tests then (only one case had a quasi-ID thing, but not for tracking).

I’m not sure what the right answer is here, but it seems we do have some time to continue to consider different options.

I agree there's time to work it out.

Would clearing all temporary and persistent data from their browser when they uninstall/reset an "app" be what users would/should expect? It seems if we go that route, we would need to include some strong language to implementors that they need to make users aware of such implications.

Yes, looks like something that needs to be written down in the spec.

mgiuca commented 5 years ago

Just caught up on this: I don't think it's feasible to do any manipulation of the query for the start_url. It is completely legitimate for the content of the page to be based on the query, and reasonable (though inadvisable) for the "home screen" of an app to be at a particular query, where deleting the query string takes you to a different page.

In fact we are having a parallel discussion about extending service worker and manifest scope to allow query parameter matching, because some websites actually distinguish different "apps" based on the query string. If at some point in the future, the manifest scope will be distinguishable by query parameter, it will be necessary for start_url to not ignore the query string (since start_url must be within scope of scope).

From a privacy standpoint, there is no advantage to removing the query string, since tracking information can easily be encoded in the path. It is inherent in giving sites the ability to (at the user's choice) save a URL on the user's machine that they can re-open later, that they can encode user-identifying information in the URL.

aarongustafson commented 5 years ago

@mgiuca Agreed. Excellent summary.

aarongustafson commented 5 years ago

I vote we close this issue. Anyone disagree?

othermaciej commented 5 years ago

start_url is a potential tracking vector whether or not it contains a query string. It should at least be mentioned in Privacy Considerations if it isn't already.

It is distinct from bookmarks, because there's an easy opportunity to create a unique ID each time manifest.json is requested. To achieve the same for bookmarks would require redirects to ensure the user ID is visibly present in every URL on the site. It's much easier for a problem like that to be noticed by at least some thoughtful users and to be raised to public awareness.

There are possible mitigations other than entirely removing start_url. For example, manifest.json could be fetched on a caching proxy server to prevent stuffing a unique ID in it. (This specific solution creates a new privacy risk that the proxy operator could see the user's browsing; there are likely privacy-preserving solutions to this using crypto or bucketing, similar to the way safe browsing databases work.)

I think this problem should be taken seriously. Tracking via URL parameters is an increasingly common technique on the web in general, to the point that WebKit deployed active mitigations for it. If this technique hasn't made it to PWAs yet, that is only good fortune, not a trait to be relied on.

Further, just because there are valid use cases for start_url does not mean that the privacy issues should be ignored. After all, cookies have valid uses cases too.

dominickng commented 5 years ago

Thanks for the considered thoughts @othermaciej. Note that this issue is already explicitly called out in the spec with recommendations to implementers - it certainly isn't being ignored.

However, I'm not really sure we can say that "tracking via bookmarks" is more likely be noticed or easier than using the start_url given the flexibility of URL processing. Specifically, the identifying token need not be consistent - each navigation, new, random, but still uniquely identifying tokens could be generated for every single in-scope link to create obfuscation - and these IDs don't have to just be in the query string or fragment identifier, as @mgiuca pointed out.

Concrete example: Stack Overflow's canonical question URLs are of the form https://stackoverflow.com/questions/15476907/recommended-usage-of-stdunique-ptr. However, the URL still resolves correctly with any arbitrary text after the question ID, e.g. https://stackoverflow.com/questions/15476907/id-goes-here. Each navigation that text can be changed in a way that still uniquely identifies the user.

Indeed, Stack Overflow also generates links where the user ID is appended to the end of the question ID (see https://meta.stackexchange.com/questions/164194/is-there-a-way-to-shorten-stack-overflow-urls), which is a case where the identifying token is even more difficult to discern through observation.

In general, one challenge with mitigations here is that they would be straightforward to overcome for sufficiently motivated parties - at the expense of eliminating very legitimate use cases.

othermaciej commented 5 years ago

In general, one challenge with mitigations here is that they would be straightforward to overcome for sufficiently motivated parties - at the expense of eliminating very legitimate use cases.

@dominickng I am very much aware that parts of the URL other than query can be used to smuggle a tracking ID. Note that the two mitigations I mentioned (remove start_url entirely, coalesced loads w/ caching proxy or bucketing or the like) do not leave the path attack open, and one doesn’t even block any use cases.

mgiuca commented 5 years ago

Neither of those mitigations are satisfactory or sufficient.

remove start_url entirely

What would be the start page for the app? /? That prevents having multiple apps on an origin (so you couldn't serve a PWA on github.io, for instance, because you can't directly host content at the root directory of the origin).

Even if we were willing to remove start_url, I believe a determined site could still stuff user-identifying information into a suborigin (creating a unique origin for each user).

For example, manifest.json could be fetched on a caching proxy server to prevent stuffing a unique ID in it.

This requires a huge investment in infrastructure by browser manufacturers, and adds new privacy problems with even more complex solutions required (as you touched upon previously). Even ignoring that fact, how would it help avoid fingerprinting? Instead of putting the user-identifying token in the start_url, you now put it in the manifest URL. Now the caching proxy is serving a unique manifest URL to each user.

This also prevents legitimate server-side customization of the manifest based on the user's request. For example, some sites deliver a different manifest based on the user agent (sometimes necessary because the manifest is declarative, client-side customization can't be done). And it also creates the usual headaches associated with caching and server-side changes. If the cache has too long of an expiry time, updates to the manifest may be delayed from reaching users. The only practical solution is for the proxy to respect HTTP caching headers to expire the cache and go back to the server, but then the server is in control and can circumvent the cache by having an extremely short cache time.

Basically, this is fairly infeasible as it would break basic assumptions of how requests work, and as far as I can see, it wouldn't be able to stop fingerprinting anyway.

npdoty commented 5 years ago

@aarongustafson I would oppose closing this issue as it doesn't seem like we've addressed any of the concern raised in 2015. How should clearing local state interact with installed PWAs?

It seems like either:

we indicate in the spec that start_url should not be used for user-specific data storage, and then browsers and researchers can work on means to detect and block such usage (proxies, URL modification, removing start_url, observatories being some brainstormed suggestions raised in this thread); or,
we indicate that start_url and similar mechanisms can have user-specific data, and that clearing local storage should clear that URL, and provide means for re-triggering installation of apps when data is cleared, or require isolating all data between each installed PWA and the browser; or,
we accept that virtually every user who installs a PWA from a service that also provides online tracking will have a permanent re-spawning cookie from that service whenever they look at their email or social network (unless every such user has more advanced heuristics than browser vendors and visually inspects all start URLs and determines whether they contain tracking identifiers and manually changes them!), and that that cookie will re-appear even after the user tries to clear all local state on their browser or device. This looks like the status quo, and I don't think it's compatible with users having either transparency or control.

Not all mitigations need to be absolute solutions. Throwing up our hands and accepting permanent, unclearable identifiers is not a sound response to some mitigations being imperfect.

lknik commented 5 years ago

we accept that virtually every user who installs a PWA from a service that also provides online tracking will have a permanent re-spawning cookie from that service whenever they look at their email or social network

I don't think point 3 is a particularly sustainable solution, even if this is the case now.

I would prefer that until we make an actual choice, following a review (I would mean here even an incremental TAG design review perhaps?), the issue remains open. Thanks Nick for distilling it so well.

mgiuca commented 5 years ago

Thanks @npdoty for the clear breakdown of options.

I think 2 is reasonable, and something we've been thinking about in a wider context (e.g. on Web Share Target) is that the "installedness" of a PWA should be tied to the local storage of the origin, just like service workers, indexed DB, cookies, etc. If you install an app, you should expect its data to remain persisted on the device. By corollary, if you clear the data associated with an app, you would expect the app to be uninstalled.

This should happen in both directions: if you uninstall an app, we should ask if you want to clear origin storage (I believe we're starting to do this in Chrome).

I am happy to add language to the manifest recommending that user agents tie installedness together with the app. This can't be a requirement, though, since clearing of user data is a UI feature of browsers, not something directly speccable. Also, it may not be possible to do so programmatically on some platforms (I think on Android, if you clear site data from inside the browser, we don't actually have the ability to delete home screen shortcuts).

Number 1 is not feasible. We can't say "should not be used for user-specific data storage" because the specification controls the behaviour of user agents, not sites. And as I said previously, interfering with the transmission of the manifest is a minefield.

alancutter commented 5 years ago

clearing local storage should clear that URL, and provide means for re-triggering installation of apps

I'm not sure how re-installation would work if we clear all site provided URLs. Where would we get the start_url from?

aarongustafson commented 5 years ago

This should happen in both directions: if you uninstall an app, we should ask if you want to clear origin storage (I believe we're starting to do this in Chrome).

100% on this (something we’ve been discussing as well). Where it gets tricky, however, is if a user is browsing an installed PWA within the context of a browser tab and then decides to clear the site data. I feel like there should be some sort of prompt to ask if they wish to uninstall the app as well. They may not want to. For instance, sometimes cookie issues can cause a site to get into a broken state and resetting everything is more desirable than a re-install.

npdoty commented 5 years ago

I feel like there should be some sort of prompt to ask if they wish to uninstall the app as well. They may not want to. For instance, sometimes cookie issues can cause a site to get into a broken state and resetting everything is more desirable than a re-install.

That is indeed tricky. Because does a user understand what the difference is between "resetting everything" and "a re-install"? I'm literally not sure I do, especially if (as in Option 2) we are going to give explicit guidance that the installation process is likely to include personalized state. I think user options to "clear local state" that don't clear known state mechanisms is a dangerous direction. That's why I think we should be prepared to give an answer for how an implementer can clear all local state, including uninstalling or triggering the re-install for an app.

npdoty commented 5 years ago

@mgiuca

Number 1 is not feasible. We can't say "should not be used for user-specific data storage" because the specification controls the behaviour of user agents, not sites. And as I said previously, interfering with the transmission of the manifest is a minefield.

Just to be clear, I wasn't suggesting in Option 1 that including a spec requirement that sites not keep personalized data in start_url was a sufficent mechanism; I agree that spec requirements are neither read nor followed by sites. I meant instead that sites not keeping personalized data in the URL/manifest would be a shared stated goal, and then UAs (and others -- researchers could help too) would use other mechanisms to encourage or enforce that goal.

(We still might not prefer Option 1 because maybe we might conclude that UAs and others can't feasibly enforce it at all. I don't have a strong opinion on whether that's feasible. I just mean that it would be on UAs and other members of the community to try to make Option 1 happen, if that's the preference.)

npdoty commented 5 years ago

@alancutter

clearing local storage should clear that URL, and provide means for re-triggering installation of apps

I'm not sure how re-installation would work if we clear all site provided URLs. Where would we get the start_url from?

Yeah, I don't think it's easy, but better to figure that out now. Presumably send the user back to the bare origin (maybe even with a well-known parameter regarding re-install), if we think the bare origin is unlikely to include user-specific state.

In the analogy to app stores, for example, a user can uninstall and then re-install an app and not assume it's already customized to them because an app in an app store is typically limited so that it's the same URL and same code for all users. Or at least, that's our typical assumption, because of the costs of going through app store review and the incentives for having many users of a single app. If Progressive Web Apps accrue reputation and reviews as well, then there would be a disincentive for app owners to make every one on its own domain (or even own URL) because it won't be able to aggregate reviews and installation counts.

dominickng commented 5 years ago

Unfortunately, the PWA may be scoped to a path (e.g. example.com/pwa), or it may be that multiple PWAs are hosted on different sub paths on the same bare origin, making a single well-known parameter insufficient for informing the site what to do if we've thrown away the original start_url.

I've often thought that we should have required one-PWA-per-origin, but this would be a very challenging requirement to impose now.

marcoscaceres commented 4 years ago

It occurred to me that if an application is "installed", and so long as we consider an "installed" application MUST have separate storage than regular web pages, then the problem may be more manageable: once an application is "installed", clearing storage form the browser does not affect the storage of the installed application.

Also, then navigating the installed web application doesn't affect the storage of the browser: i.e., no cookies are set in the browser.

Further, opening a site in the browser does not open the installed application (hence, the "supercookie" problem doesn't apply to the browser, only to the installed application).

Like with native applications, the way to clear the storage of an "installed" web applications is to either:

uninstall it.
use the OS-level storage management to clear its stored data.

2 above would recreate the super-cookie for the installed application via the start_url (or even url of a shortcut object), but at least it doesn't affect the browser's storage.

dominickng commented 4 years ago

Mandating separate storage at the spec level seems a pretty heavy-handed way of addressing this issue. It would cause a number of other (not necessarily desirable) ramifications that greatly affect the utility of web apps. For instance:

let's say you use a web app for a while in the browser, and then you install it. After installation, the web app loses all of its existing local state, including cookies, local storage, service workers, offline cache, etc.
- there's no sensible way to migrate everything to the new storage unless you copy the entire ETLD+1's cookies and the whole origin's worth of data, which may include way more than the web app actually owns
- assuming you've figured out that copy manoeuvre, do you then delete the stuff you copied?
OAuth is broken (or otherwise severely crippled) as the 3rd party OAuth providers won't be logged-in within the new storage partition.
what backing storage should off-scope browsing within the installed web app use? The browser storage? The web app's storage? Both have undesirable consequences for reasoning about where the user's data is. If you use the browser storage, you then break OAuth even more (you'll be logged into the browser storage but not in the web app storage).

This isn't to say that separate storage for installed PWAs is a bad idea - we've debated applying this policy in Chrome for years now as a way of addressing privacy concerns. However, the questionable user and developer consequences are why we think separate storage for browser-installed web apps is not really viable.

benfrancis commented 4 years ago

Mandating separate storage at the spec level seems a pretty heavy-handed way of addressing this issue. It would cause a number of other (not necessarily desirable) ramifications that greatly affect the utility of web apps.

+1. Firefox OS had a separate "data jar" per installed application and to cut a long story short it broke the user experience of web content in lots of ways, some unexpected. The authentication use case mentioned above being one of them. I would recommend against enforcing that in the specification.

If implementations supported "deep linking" then the problem wouldn't be quite so bad, but currently:

Navigating to out-of-scope content from within an application context usually stays within the application context (or a special popover style window)
Navigating to in-scope content from an external browsing context does not get redirected to the installed application context

If every application context has its own data jar, both of the above serve to fragment local storage across multiple jars. This has the side effect that the user is repeatedly forced to re-authenticate to access the same content in different contexts for example.

I agree that the fingerprinting problem applies just as much to bookmarks as it does to "installing" a web application. Do any browsers currently try to strip unique identifiers from bookmarked URLs?

npdoty commented 4 years ago

@dominickng @benfrancis Is there documentation or a little more detail on the authentication (or other) use cases that are broken by using a separate cookie jar for an installed web app's origin? I fully believe those problems exist, but I could benefit from learning more. In particular, I thought the OAuth dance was specifically designed so that the user authenticates with the authenticating site in a browser navigated to the authenticating site rather than through shared cookies.

Are implementations of installed web apps currently staying in scope when the user browses to a link outside the app's scope? That seems surprising to me as a user, and I'd be curious to learn the motivation for that design choice. Why not just have an installed web app that operates in its scope and when users click a link to another origin it opens in the user's web browser, like with other installed apps?

Maybe we're getting confused by the definition of separate storage. It seems like one feasible approach that @marcoscaceres may be describing is that at install-time data from the origin is copied (or started fresh) to a new cookie jar for the installed web app, but that if you subsequently clear all data from that origin in your browser then you're still logged in in the app but not in the browser, even after you open the app again. This may be the current implementation in mobile Safari/iOS.

lknik commented 4 years ago

I also believe exploring "separate storage" may be worthwhile. Like @npdoty suggests, why wouldn't clicked link simply open in a "separate window" (would that be supported on mobiles though in current way PWAs are implemented?).

Additionally, perhaps it makes sense to update the security & privacy document and reflect it with the knowledge in this thread, unless you think it's too early for that?

benfrancis commented 4 years ago

@npdoty wrote:

Is there documentation or a little more detail on the authentication (or other) use cases that are broken by using a separate cookie jar for an installed web app's origin?

There's discussion on this spread over many issues over a number of years, e.g. https://github.com/w3c/manifest/pull/701

The classic example is something like calendar.google.com redirecting to accounts.google.com and then back again.

Are implementations of installed web apps currently staying in scope when the user browses to a link outside the app's scope? That seems surprising to me as a user, and I'd be curious to learn the motivation for that design choice. Why not just have an installed web app that operates in its scope and when users click a link to another origin it opens in the user's web browser, like with other installed apps?

I agree with you, but it's not as simple as it may first appear. I just wrote some comments related to this here.

w3c / manifest

Privacy Review: handle start_url tracking #399