Privacy Review: handle start_url tracking

mounirlamouri commented 8 years ago

To summarize @npdoty comments in https://lists.w3.org/Archives/Public/public-privacy/2015JanMar/0117.html there are concerns about start_url containing special ids or simply something that hints that the user is coming from a homescreen application. This is fingerprinting/privacy sensitive information that the user might not be aware of.

I think the issue of people doing start_url: 'index.html?from_homescreen' is something we might want to mention in the spec but I don't think we should encourage browsers to prevent this because it is clearly something websites want for various reasons (mostly statistics).

However, I am concerned about having start_url: 'index.html?$GUUID' because it is a way to track the user without them being aware of it. I'm not sure what the spec should say or the browsers could do. Maybe we could recommend showing the start_url to the user and allow them to edit it?

dominickng commented 4 years ago

Why not just have an installed web app that operates in its scope and when users click a link to another origin it opens in the user's web browser, like with other installed apps?

@npdoty @lknik as @benfrancis mentioned, while this seems simple on the surface, in practice, it breaks lots of behaviour in subtle ways because web content fundamentally assumes same-browsing-context navigation unless explicitly requested otherwise. Moving to a separate storage partition is effectively akin to spawning a new browsing context for out-of-scope links.

For OAuth, the auth correctly happens in a browser window, but the original web app either isn't refocused, or isn't passed the token it needs to confirm that auth has happened correctly since it's passed in POST, through XHR, or some other slightly unusual configuration that worked when everything was in the same browsing context (with the end result being that you're not authed in the web app).

This doc collects a number of reported issues when Chrome previously implemented the link-opens-in-new-browsing-context behaviour. A couple of instructive (and lengthy) bug reports are here and here

We tried playing whack-a-bug for a while here in Chrome on Android, but eventually it became clear that the correct thing to do was preserve the expected navigation behaviour - if a link is clicked and the developer or user hasn't requested that it open a new browsing context, stay in the current browsing context.

npdoty commented 4 years ago

Thanks so much for the documentation and examples!

In some cases it looks like the navigation context change is the problem, because that doesn't necessarily allow redirecting back in the traditional OAuth case. It doesn't seem like cookie isolation for the app's origin would by itself be the issue. (It seems like iOS from 12.2 uses a shared navigation context -- a mini in-app browser, but separates storage in the PWA and its in-app browser from the Safari browser?)

dominickng commented 4 years ago

Safari's implementation results in the following scenario that I wrote about above:

let's say you use a web app for a while in the browser, and then you install it. After installation, the web app loses all of its existing local state, including cookies, local storage, service workers, offline cache, etc.

there's no sensible way to migrate everything to the new storage unless you copy the entire ETLD+1's cookies and the whole origin's worth of data, which may include way more than the web app actually owns.

As @benfrancis noted:

If every application context has its own data jar, both of the above serve to fragment local storage across multiple jars. This has the side effect that the user is repeatedly forced to re-authenticate to access the same content in different contexts for example.

If you want to avoid the fragmentation and need for reauthentication (i.e. use the browser's storage for off-origin navigation), you then run into the navigation context switch change problems.

marcoscaceres commented 4 years ago

let's say you use a web app for a while in the browser, and then you install it. After installation, the web app loses all of its existing local state, including cookies, local storage, service workers, offline cache, etc.

This is where implementations will differ. Like, personally, I think it's totally fine to just carve out a totally separate + fresh storage: browsers have gotten pretty good at credential management in general, so restoring state by signing into a website after install seems totally reasonable (to me*).

*probably why they don't let me anywhere near product development 😜

nuxodin commented 4 years ago

I think the question of fingerprints is a serious problem. The only solution I can think of would be a central registry, which would only allow one url per domain and language.

"start_url": "https://pwa-registry.org?url=example.com/app.webmanifest&lang=en"

The url https://pwa-registry.org?url=example.com/en.webmanifest&lang=en then would redirect to example.com/en.webmanifest or deliver its contents.

Someone like mozilla would have to maintain the registry.

I would even assume this a installability signal.

alancutter commented 4 years ago

This adds a new privacy risk of having a central server see every web app launch on the internet.

marcoscaceres commented 4 years ago

I honestly don't think there is a way to solve this. It's inherent in the design of URLs that you can encode unique identifiers into them by using an unlimited range of patterns and by mixing and matching theirs structures.

alancutter commented 4 years ago

User agents can solve this by reinstalling the web app from a fixed install URL designated by the user after every session similar to using bookmarks in an incognito window.

marcoscaceres commented 4 years ago

@alancutter, I don't follow... can you give a concrete example of what you mean? I'm looking at my bookmarks in private browsing mode, and I don't see the browser changing them in any way when I click on them?

alancutter commented 4 years ago

Given an install URL decided by the user/admin policy (not the app and not containing tracking data) the browser could do a fresh install of the app for every user session. This resolves the tracking problem by making start_url ephemeral rather than persistent.

mgiuca commented 4 years ago

@alancutter I don't think that's very useful (unless the app is being controlled by an administrator or particularly careful user who is inspecting the URLs of the manifests being installed). You can always encode user-identifying info into one of the many URLs. If we fresh install "the app" every session, we're fresh installing from some manifest URL which could have user IDs in it. Or from a start URL that has IDs in it. At some point, what we consider to be "the app" could in reality be one of millions of different apps, one for each user.

The only way to prevent that is to have the user manually inspect all the URLs to see if any of them have something that might look like an ID. That's not feasible for the majority of users. Even a power user ... well how am I going to know if something is an ID or just something like a content hash?

The problem becomes quite intractable to solve properly even for power-users. I think we should just admit that it's a potential attack.

marcoscaceres commented 4 years ago

Agree. I'm closing this as we acknowledge this problem, but it's not solvable because it's inherent to URLs. We let implementers know this is a problem and provide possibilities to mitigate through the UI. https://www.w3.org/TR/appmanifest/#privacy-consideration-start_url-tracking

npdoty commented 4 years ago

I laid out three possible approaches here: https://github.com/w3c/manifest/issues/399#issuecomment-534274801 Can we document which of these we are pursuing? (It seems like either 2, but maybe 3.)

To repeat that question: How should clearing local state interact with installed PWAs?

The current privacy note in the spec just suggests that maybe users should be able to inspect the URL and hope they find, recognize and realize they can remove identifiers. And if the user clears local state, do we expect the start_url to include data on the user and re-spawn all their cookies?

wwwizzarrdry commented 4 years ago

It's possible the recommendation could be that UAs strip any query string (or fragment identifier) from the URL when launching, but there are likely legit, non-privacy-invasive uses for these as well (e.g., language preference).

I came to post the same solution, but in cases of legit, useful parameters, it's not very hard to expand the manifest params to include these "legit, non-privacy-invasive uses". If language prefs are vital to PWAs, it can have its own key:value pair within the manifest, while still trimming start_url down to just the top level domain.

start_url could even be parameterized from a list of approved key:value pairs within the manifest, and just drop any that don't match the key, or the format of the value isn't recognized.

ssb22 commented 4 years ago

start_url is not the only place that could store a user ID. It could also be embedded into one of the Javascript files of the app itself, if the server is able to send a customised version of the app on every download. The only way to be sure you've cleared everything is to uninstall.

pizzapanther commented 4 years ago

Sounds like what is needed is a watchdog service that checks manifest files for privacy concerns and app stores that accept PWAs should have a check for this also. Also if disallowing randomly generated strings or user state in the URL were mentioned in the spec then maybe manifest validators would check for it. So while this answer is outside of an immediate fix, being in the spec gives better direction into validation of manifests.

ssb22 commented 4 years ago

But it's not just the manifest file that may contain an ID.

What if I modify my server as follows: Whenever any HTML or Javascript file is fetched, the server adds the current time as a string to its contents (in a suitable place). If any browser tries to re-fetch it with the If-Modified-Since HTTP header set, the server returns Not Modified.

Using this server, you can make a Progressive Web App that 'knows' the exact time at which it was downloaded, without needing to put anything special in the Manifest. Provided the download rate is not too high, this timestamp can be used to identify a user across cookie-clear events etc, unless a fresh copy of ALL files is re-downloaded (i.e. app is uninstalled and reinstalled).

Policing the Manifest won't help unless you also police the server that serves the rest of the files (a trusted "app store" server should be OK, but a third-party server can do tricks).

pizzapanther commented 4 years ago

I think you need to sign the app and have a predefined list of files signed. This is what happens in the native app process. This definitely makes the process of changing code less flexible but also increases security. Right now most apps are open to man in the middle attacks and file tampering. So I could see PWAs going that direction no matter what.

lknik commented 4 years ago

But it's not just the manifest file that may contain an ID. What if I modify my server as follows: Whenever any HTML or Javascript file is fetched, the server adds the current time as a string to its contents (in a suitable place). If any browser tries to re-fetch it with the If-Modified-Since HTTP header set, the server returns Not Modified. ..

I would be wary from inventing clever schemes, which at times can always be done this way or another in relation to many web features. I simply feel it would be more useful to limit the focus to the PWA/manifest/start_url. I simply fear that if we continue to expand the view here, we may end up in undesirable place ;-)

lknik commented 4 years ago

I think you need to sign the app and have a predefined list of files signed. This is what happens in the native app process. This definitely makes the process of changing code less flexible but also increases security. Right now most apps are open to man in the middle attacks and file tampering. So I could see PWAs going that direction no matter what.

Interesting. So something like Certificate Transparency - but intended to dynamic manifest files, like say, Manifest Transparency Extension?

(It would require additional infrastructure, though - I just do not know exactly if we are there today in regards to how serious privacy/tracking is treated in practice, as to the motivation for rolling up such a scheme)

pizzapanther commented 4 years ago

Interesting. So something like Certificate Transparency - but intended to dynamic manifest files, like say, Manifest Transparency Extension?

Yes I think something like that is going in the right direction. And yes it would require more infrastructure. That's why I think the first step is to put it in the spec and then start a monitoring service. Giving PWAs a privacy "rating" with such service might be enough of an incentive to not do the practice without a huge Transparency framework. Although as PWAs gain more traction and higher privileges, I think you'll go in that direction anyway for both privacy and security.

marcoscaceres commented 4 years ago

A proposal made internally was just to use a well-known URL. That would basically solve most things: it strips fragments, queries, and arbitrary paths where identifying information could be stored (doesn't solve for sub domains, only tld+1 would do that but that seems impractical for things like GitHub pages).

That could then be coupled with a hybrid solution: when a user installs an app, partition it into its own storage compartment. Then, for sites that depend on authentication, require the user to log in again using password autofill, webauthn, WebOTP, Credential Management API, or whatever standard authentication mechanism the site depends on. It's a small inconvenience for a big privacy assurance.

dominickng commented 4 years ago

Using a well known URL creates a large migration problem: all currently installed PWAs that don't already conform to the well-known URL would be broken, and for many sites, fixing that problem would require a site re-architecture that might not be that likely to happen. How would that problem be practically addressed?

Additionally, removing fragments, queries, and arbitrary paths removes a significant amount of positive utility (the classic tradeoff of the design of URLs). How could you replicate that utility?

marcoscaceres commented 4 years ago

Using a well known URL creates a large migration problem: all currently installed PWAs that don't already conform to the well-known URL would be broken, and for many sites, fixing that problem would require a site re-architecture that might not be that likely to happen. How would that problem be practically addressed?

We'd have to start warning and ask developers to migrate over the next N years. Or a browser vendor would need to take the compat hit. Alternatively, we see what percentage would be impacted and make some determination based on that.

Additionally, removing fragments, queries, and arbitrary paths removes a significant amount of positive utility (the classic tradeoff of the design of URLs). How could you replicate that utility?

Those we would need to look at on a case-by-case basis and see if we can provide the same utility in some other way.

alancutter commented 4 years ago

Do the same privacy concerns apply to all URLs in the manifest that get navigated to? E.g. file handlers and shortcuts.

mgiuca commented 4 years ago

Yes. I don't think special-casing start_url really helps here. The fundamental fact that you have an app installed means that there is a potential unique identifier associating your device with that site, stored on your computer, which may be reported back to that site, and used to regenerate cookies.

This isn't specific to PWAs. This is true of bookmarks and any other mechanism that saves URLs to later navigate back to the site. (As discussed much earlier on in this thread.)

The most helpful approach which I'd like to focus on is @npdoty 's thoughts along the lines of clearing storage. In my opinion, we should treat the existence of a PWA installed on the user's device as another form of local storage, like a cookie or indexed DB. If you clear cookies for an origin, but you don't uninstall the PWA, then you haven't completely cleaned out the presence of that origin on your device.

Therefore, I think the best recommendation we can make to browser manufacturers is that any dialogue that offers to clear cookies and other local storage should also offer to uninstall any PWAs or shortcuts (and maybe bookmarks?) whose scope lies in that origin. A "clear all" button (or "select all" checkbox) should include clearing PWAs.

Edit: I filed crbug.com/1112220 to track this in Chromium.

Trapfether commented 3 years ago

Like @mgiuca and others have said. Special casing the start_url would really only move the goal post. Installing an application is an inherit action of storage. The best solution to this issue would be to allow the user to uninstall the application via the clear data functions of various UAs. I would suggest that any clear all function allow for the opting out of uninstallation; especially for an option to "clear all data across multiple origins" as I am sure that removing a number of apps from the user's device by clearing data would confuse many users and lead to unintended data loss. It is also extremely possible that users will distinguish installed apps and their website counterpart as separate and not expect that clearing data from one to affect the other at all. So maybe during a clearing dialog the user is presented with the option to check mark the current context (web / app) and the other context.

dmurph commented 2 years ago

@npdoty said:

... It seems like either:

we indicate in the spec that start_url should not be used for user-specific data storage, and then browsers and researchers can work on means to detect and block such usage (proxies, URL modification, removing start_url, observatories being some brainstormed suggestions raised in this thread); or, ...

We (editors) agree that something like 1 would be good guidance. We can include language like that in the spec to dissuade user ids in the start_url. We will send a pull request.

marcoscaceres commented 2 years ago

We've put up some proposed text to address this at https://github.com/w3c/manifest/pull/1029 ... we would appreciate feedback on that to close off this issue.

lknik commented 2 years ago

I think it's fair. It's just that it does not prescribe any solution. Isn't it too soon to close the issue? It seems to be open since 2015 for a good reason.

w3c / manifest

Privacy Review: handle start_url tracking #399