usefathom / fathom

Fathom Lite. Simple, privacy-focused website analytics. Built with Golang & Preact.
https://usefathom.com/
MIT License
7.55k stars 365 forks source link

Don’t assign unique identifiers/fingerprints to visitors by default #14

Closed da2x closed 6 years ago

da2x commented 6 years ago

Disclaimer: I’m not a lawyer and this isn’t legal advise.

It would be useful for General Data Protection Regulation (GDPR) compliance to not store IP addresses, cookie identifiers, or other unique fingerprints. The current unique identifiers can be decoded back to IP addresses. See EU GDPR and personal data in web server logs for context.

I’d like to see this as the default mode, but at least make it an option. This could be a unique selling point

IP addresses also aren’t all that useful any more for assigning a unique identifier as mobile devices roam between different networks several times in a normal day (at home, mobile carrier, work, café, etc.) IPv6 reduces the usefulness of this further by assigning new addresses periodically (usually once per reboot, reconnect: or 48, 24, or 12 hours depending on operating system and network environment).

Here are some ideas and alternatives approaches to get the same data in aggregate without assigning unique identifiers to each user:

Pageviews per session and number of unique users:

  1. Set cookies in the responses to /collect that runs an incremental short-lived/session pageview counters. E.g. Set-Cookie: ana_pageviews=1; path=/collect; max-age=3600 (1 hour session). Second request you send back the same cookie with a value of 2, etc.
  2. Increment $unique_sessions (unique users) by one per request without this cookie. Increment $sessions_with_atleast_2_pageviews by one, etc.

User-retention/repeat visitors:

  1. Set a cookie in the responses to /collect that includes an imprecise timestamp (e.g. only daily precision to avoid them being too unique). E.g. Set-Cookie: ana_lastvisit=2018-04-23; path=/collect; max-age=7776000 (3 months). Reset on every visit.
  2. Don’t copy the exact timestamps, but find the time since the last visit from now() - $cookie['ana_lastvisit']. Maybe don’t track this within an active session ($cookie['ana_pageviews'] is set)?

What else is needed to track?

On the use of cookies: The cookies are transparent (even self-explanatory), their use is easy to explain in a privacy policy, and in my opinion they should be GDPR-friendly. They’re not used to track the behaviour of an individual users, just the movements and trends in the herd.

Disclaimer: I’m not a lawyer and this isn’t legal advise.

dannyvankooten commented 6 years ago

Hey @da2x,

Thanks for the suggestion and the thought out comment. This is great stuff!

We've been contemplating on whether we want to use fingerprinting vs. using a cookie and came to the same conclusion as you, using a cookie is actually more privacy-friendly as fingerprinting as with the cookie users can delete them in order to be forgotten.

Eventually we want to support "visitor paths" and I have yet to think of a way to do that without assigning a unique identifier to each user. This can and should definitely be anonymous though, imo. Do you have any ideas on how you'd tackle visitor paths without the use of identifiers?

da2x commented 6 years ago

@dannyvankooten, how is information about the visitor path more useful that just storing the referrer information for a given page? If you can tell me more about the specific requirements of this feature, I may be able to come up with another solution to get the same data.

Just use the Referer header. In aggregate, it’s enough data to build a visitor-path without profiling individual users. E.g. 6 % of visitors to /b had /a as referrer and 3 % of visitors to /b were referred to by /c, etc. You can’t tell the journey of one specific visitor, but you can determine that page /a is better at referring users than page /c. If you then look at which pages referred to /a you can build a path:

/b referrers:
- /ddg.co 60 %
- - /a 6 %
- - - /ddg.co 80 %
- - - /d 8 %
- - /c 3 %

(Speaking of aggregate data and referrer information: unique referrer data should be dropped after a week. E.g. if only 2 clicks came from example.com/webmail, then delete the link after a week. If 600 people came from reddit.com then that should be considered non-unique and you can keep storing it.)

rosswintle commented 6 years ago

This is an interesting discussion and I hope you don't mind me both chipping in and following. @dannyvankooten - you look like you're WAY more into this than I am. But I'm currently working out this exact same thing on my (much smaller and probably less viable) Kownter project. In fact, one of the reasons I'm building it is to see what can be done without cookies.

Some kind of aggregate flow between pages as @da2x suggests is pretty much the thinking that I came to . I don't need to see individual's paths if I can see the aggregate drop-off as people move through the user journey. In this post I attempted to explain this, saying: "we can still report the ratio of conversions against page views."

OK, so there are probably some advanced cases where you want to know more than that but if you want that then you probably actually want GA, right?

I've also been wondering if it makes sense, is useful, and is performant enough (server-side) to store a cookie, but to recycle it on each visit. And, if you do this, is it any more private than just setting a cookie and leaving it there? I'm not sure it is.

I am not actually convinced that a session ID is personally identifiable as there's no reverse-lookup. If you were storing a session ID alongside an IP address then that would be different. I know GDPR says something about web-tokens, but I think what you/we are doing here is way within the spirit of GDPR.

Interested to see how this progresses.

dannyvankooten commented 6 years ago

Hey @rosswintle,

I definitely do not mind - quite the opposite. Thank you for chiming in! Kownter looks super interesting, I'm glad that there are more people thinking about solving analytics in a better way and providing more options besides "just use GA, even though you don't use 90% of what they're collecting".

I'm going to go through all of your posts as there's most likely a ton to learn - I'm only just getting back into this project and forgot about a lot of the decisions that went into this when I started 18 months ago.

I am not actually convinced that a session ID is personally identifiable as there's no reverse-lookup.

Same for me, although there may be other off-site identifiers that will still give away this particular user (eg cross referencing timestamped actions in app?).

Anyway, at this stage I lack the intricate details to really add anything to this discussion, so I'll get right to reading your posts and experimenting to see if there's something we can do to tackle this. Dropping the unique visitor ID entirely is definitely worth striving for, I'd say!

rosswintle commented 6 years ago

Great! I think with the backing of @pjrvs and whatever design skills you have that are way better than mine, your project will be much less of a toy/experiment than mine. Plus, writing it in Go probably helps with the scaling a LOT (though I think the downside is that self-hosted deployment might be harder for some people).

Great that some of what I've done developing the early stages in the open might help. Always happy to talk about the experience! I'll leave you and @da2x to work out the specific issue listed here.

da2x commented 6 years ago

@rosswintle, put this on repeat and get back to working on Kownter. It’ll be a great alternative (more alternatives is great!) and you just need to stay motivated. Make it your own and it’ll probably turn out great. Nick some stylesheets and graphs from Fathom if you like the visuals and stuff your own data in them.

This isn’t legal advice and I’m not anyone’s lawyer. The following could very well be totally wrong: Fewer magical identifiers means more transparency. It also mean people won’t contact the operator of the analytics service to ask for a copy of the data belonging to $magical_token or ask to have the data of $magical_token deleted. I specifically suggested cookie names that were named after the data they contain instead of a single magical cookie containing all the data. Individual cookies are more easy for people to inspect. Opting out of this is as simple as disabling cookies, and their use is easy to document in a privacy policy.

The following is provided for context (GDPR sections mentioning identifiers as relevant to this discussion):

GDPR Recital 30:

Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.

GDPR Article 4 Definitions: Point 1:

‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

dannyvankooten commented 6 years ago

Quick progress update: Fathom now relies on storing all visitor-specific data on the client side so that the server only has to keep track of aggregated data. Not only did this allow for much simpler code and better privacy (as discussed), it'll scale a whole lot better too. So thanks @da2x and @rosswintle for lending your brains here. Super helpful!

Right now visitors are still assigned a random ID but it is only stored for a theoretical maximum of 30 minutes (the expiration time of a session). If a visitor visits multiple pages, all pageview hits except the last one is deleted within 5 minutes (the time between aggregation).

The visitor ID is only needed to give an indication of realtime visitors (that is: distinct visitors that did a pageview or performed another event within the last 5 minutes). I haven't yet been able to come up with a way to do that without storing some kind of short-lived identifier on the server, but let me know if you have any suggestions here please.

Besides the visitor ID (a random string), no other identifiable data is stored anymore. :tada: :champagne:

rosswintle commented 6 years ago

Could you, for the purposes of real-time unique visitors, just ignore any hits with an internal (same site) referring page?

(This was a super-quick thought I wanted to jot down. Will properly read and think another time.)

rosswintle commented 6 years ago

This also gives me the thought that you can set cookies to identify returning users without having an ID in the cookie.

You just set tracked-with-fathom = 1 and pick it up server side.

The PECR rules would still (currently) need the existence of the cookie to be disclosed to users. But it should keep you free of GDPR “personal data” rules.

Hmm. 🤔

da2x commented 6 years ago

Quick progress update [...]

Greatly appreciated.

Could you, for the purposes of real-time unique visitors, just ignore any hits with an internal (same site) referring page?

I was about to suggest the same. The Referer header holds this information. I don't see what a random ID adds here.

This also gives me the thought that you can set cookies to identify returning users without having an ID in the cookie. You just set tracked-with-fathom = 1 and pick it up server side.

That serves the same purpose as the last-visit cookie I suggested earlier.

The PECR rules would still (currently) need the existence of the cookie to be disclosed to users.

(This isn't legal advise, and I'm not a lawyer.) The ePrivacy Directive requirements varies greatly from country to country. Some require consent and opt-in, some require a prompt and opt-out, some just requires information, and some require browser settings to be respected (cookies, DNT). The ePrivacy Regulation (late 2018/early 2019?) changes this mess to the last option (browser settings) plus detailed documentation (privacy policy).

I'm personally aligning all my thinking with the ePrivacy Regulation + GDPR. The ePrivacy Directive and GDPR are mutually exclusive as far as I understand it, so the current situation is unclear. However, aiming for transparency (purposeful cookie names and matching privacy policies) and aiming for data minimization and avoiding any kind of user-profiling should be the way to go to be compliant with the intent of European regulations.

dannyvankooten commented 6 years ago

Returning visitors are indeed already tracked using the cookie (and not using the identifier), but the ID is used to get a list of the number of distinct visitors active in the last 5 minutes (but they might have arrived earlier than that).

I don't see how the server would be able to tell how many distinct visitors are online using the referer header, especially if they have been browsing the site for a little while (so they are not a "new visitor" but they are still "online"). Am I overseeing something here?

On May 7, 2018 10:12:36 PM GMT+02:00, Daniel Aleksandersen notifications@github.com wrote:

Quick progress update [...]

Greatly appriciated.

Could you, for the purposes of real-time unique visitors, just ignore any hits with an internal (same site) referring page?

I was about to sugget the same. The Referer header holds this information. I don't see what a random ID adds here.

This also gives me the thought that you can set cookies to identify returning users without having an ID in the cookie. You just set tracked-with-fathom = 1 and pick it up server side.

That serves the same purpose as the last-visit cookie I suggested earlier.

The PECR rules would still (currently) need the existence of the cookie to be disclosed to users.

(This isn't legal advise, and I'm not a lawyer.) The ePrivacy Directive requirements varies greatly from country to country. Some require consent and opt-in, some require a prompt and opt-out, some just requires information, and some require browser settings to be respected (cookies, DNT). The ePrivacy Regulation (late 2018/early 2019?) changes this mess to the last option (browser settings) plus detailed documentation (privacy policy).

I'm personally aligning all my thinking with the ePrivacy Regulation + GDPR. The ePrivacy Directive and GDPR are mutually exclusive as far as I understand it, so the current situation is unclear. However, aiming for transparency (purposeful cookie names and matching privacy policies) and aiming for data minimization and avoiding any kind of user-profiling should be the way to go to be compliant with the intent of European regulations.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/usefathom/fathom/issues/14#issuecomment-387189541

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

rosswintle commented 6 years ago

That’s neat...I’d not thought of using cookie existence to track a return visit. I like that.

And yeah, you’re right. Without IDs you could track NEW live visitors, but not those that had been around for a while...this could be an explained limitation...or just use your temporary IDs!

Great discussion and thanks for the update.

da2x commented 6 years ago

Use the lastvisit cookie? If last visit was within five minutes then count as pageview but not unique session/person, and more than five minutes (or unset) then it's a unique session/person.

Record the lastvisit offset in minutes. E.g. add a plus one count in table active_visitors with columns datetime (minute precision), 0min, 1min, 2min, ... 15min. For any minute of the day, you'd be able to see you can see how how many active sessions/people versus total pageviews by comparing pageview counts to the active_users table. This table should be cleared regularly, of course, and data stored in aggregate in a more practical table.

This should get you the info you want from just a timestamp cookie.

ckluis commented 6 years ago

I'm 100% in favor of using no identifiers, but you could still store travel path

url: /          datetime: 1/1/2001 1:00
url: /contact   datetime: 1/1/2001 1:01
dannyvankooten commented 6 years ago

Quick update: current master does no longer store the anonymous session ID but moves knowledge of the previous pageview to the client instead, so that the client can tell us about it instead of us (the server) having to store something to get to that.

For a new visitor, the data sent to the Fathom backend will now look something like this. Given 3 pageviews:

GET /collect?id=abc&previous_id=&page=/about....
GET /collect?id=def&previous_id=abc&page=/about...
GET /collect?id=ghi&previous_id=def&page=/...

The "previous ID" is used to update the previous pageview (not a bounce, update time on page) but is not stored, so that each row in the pageviews table has nothing that leads back to the visitor generating that table entry.

id hostname page new_visitor new_session unique_pageview bounce duration referrer timestamp
abc http://site.com /about 0 0 1 0 0 2018-07-11 12:49:04
def http://site.com /about 0 0 0 0 120 2018-07-11 12:51:04
ghi http://site.com / 0 0 1 1 120 2018-07-11 12:53:04
rosswintle commented 6 years ago

That's very cool/smart.

I guess the ID is stored in the cookie for use in the next page view, right? But what does the ID actually represent and how is it generated? Is it some kind of hash of timestamp and device fingerprint?

rosswintle commented 6 years ago

(I'd read the code, but I'll be AFK very soon)

dannyvankooten commented 6 years ago

The relevant part is in assets/src/js/tracker.js. Mostly:

  const d = {
    id: util.randomString(20),
    pid: data.previousPageviewId || '',
    ....

So it's just a (truly) random string of 20 characters, leaving a small chance of collisions but given that the pageview table is cleared every minute I'd say the chances of that happening are nil. And even if it happens, it won't have any real consequences.