matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.6k stars 2.62k forks source link

Add a anonymize IP addresses setting #692

Closed anonymous-matomo-user closed 14 years ago

anonymous-matomo-user commented 15 years ago

In Germany, Piwik would be a much better alternative to Google Analytics, if there would be an option to deactivate the storing of ip-addresses. Maybe ip-addresses could optionally be saved "hashed" instead of "cleartext"?

Background: In Germany the situation is not clear - maybe storing of ip-addresses is not allowed because it harms privacy. See: [http://forum.piwik.org/index.php?showtopic=825].

Piwik would be the first software I know with this "Feature" - so every German company could use it without probably getting problems with privacy law.

anonymous-matomo-user commented 14 years ago

Attachment: suggestion.diff

robocoder commented 15 years ago

(Do German companies disable their firewall and web server log files too?)

anonymous-matomo-user commented 15 years ago

Replying to vipsoft:

(Do German companies disable their firewall and web server log files too?)

Well, some court decisions say the IP adress is a personal information (at least you can track it down to the ISP customer) some say it is not. And still we have some fragments of data privacy... However, i don't think that most companies swtich off the logging of ip adresses, but it is needed for different reasons (intrusion detection e.g.). Also, most forums and other such software store the ip address too. Btw, any staff member who has insight to private data of customers has to sign a confidentiality undertaking.

The problem with Google Analytics and any outsourced web analytics solution is that private date is given away to other companies. This may be okay in Germany. This could work in the EU. But it is risky when transfered to other countries with a lower or quite different privacy policy. Thus, some institutions have the opinion that the usage of Google Analytics in Germany is illegal (http://www.internetworld.de/Nachrichten/News/Datenschuetzer-halten-Google-Analytics-fuer-rechtswidrig).

All told, the judicial situation is vague and unclear. I'd say as long as you don't give the data away and have it stored savely, you have nothing to fear.

Admittedly, I am not a lawyer.

anonymous-matomo-user commented 15 years ago

(Do German companies disable their firewall and web server log files too?) I think most of them don't do this. But it would be more secure for a company to do this, if the company wants to be sure not to violate privacy.

anonymous-matomo-user commented 15 years ago

If we store a hash of the IP address, I think the entry can still be related back to the IP. If an authority seized a server running Piwik, they would still be able to prove that a certain person has accessed it by just calculating the hash of the suspect's IP and comparing it to the database. So privacy does not just mean protection from the owner of the server.

My suggestion: Just store a truncated hash of the IP.

Code example:

$ip='10.11.12.13';
$ipAsLong=ip2long($ip);
$ipAsHash=hexdec(md5($ipAsLong));
$anonIpAsLong=substr(number_format($ipAsHash,0,',',''),0,9);
$backToIp=long2ip($anonIpAsLong); // Result: 25.68.34.54

This way, Piwik would still have an IP to work with. But it cannot be related back to a real IP, because the hash is not complet. Yes, hash collisions will probably occur. But as the IP address is only one part of the user identification, it will not occur very often that me miss a visitor. Plus there's no need to change the table for the large full hash.

I really don't know anything about Piwik's plugin system, but in #312 it looks pretty easy to provide this kind of functionality.

Is this stupid? Will this break any other features?

anonymous-matomo-user commented 15 years ago

Looking at the code, I'm afraid it is not possible to catch all access to the IP address with plugin hooks. So the change should probably be done in the core, which seems to have been turned down already on #312. Still, this could be a quick fix for paranoid people:

In core/Common.php, change the getIp() function like this

static public function getIp(){
return sprintf("%u", (int)substr(number_format(hexdec(md5(ip2long(self::getIpString()))),0,',',''),0,9) );
}
robocoder commented 15 years ago

joux: it should be possible to catch all accesses (now that #825 [1344] has been committed to SVN).

As for a truncated hash... In the case where an "authority" seizes a server, then they inherently have the authority to inspect more than just the database, right? (e.g., server logs) Hashing would seem to strike a balance between user privacy and cooperating with law enforcement.

That said... a requirement for this plugin should be to implement a framework for anonymizing IP addresses. Site operators can then customize/extend the implementation to suit their needs, since anonymity and functionality are (roughly) inversely proportional to each other.

anonymous-matomo-user commented 15 years ago

As far as I've seen, the IP from the database is already used in recognizeTheVisitor(), in order to decide whether the user is known. So neither Tracker.newVisitorInformation nor Tracker.knownVisitorInformation have been called so far. In order to keep the (anonymized) IP as a possibility to recognize a user, the matched IP must be hashed+truncated at that point, too.

Maybe yet another hook in getUserSettingsInformation() would allow for a generic filtering possibility before the data is used/saved anywhere. But that's just a quick guess.

anonymous-matomo-user commented 14 years ago

I would like to support the feature request and maybe add some hint for easier implementation:

  1. In Germany, storing of IP-addresses is not allowed. I just read of an order of the Berlin data protection commissioner prohibiting a blogger to store the ip-adresses. Thus, using Piwik poses the risk of a regulatory offense in Germany.
  2. If the only problem using a hash of the IP-address is the limitation to a BigInt, then you could just make it fit (similar to what joux proposed):
$ip_hash = md5($ip) MOD 2^64; // make $ip_hash fit into a BigInt which is of 8 bytes size.

Can anyone tell me, where in the core I would have to change this? I want to continue using Piwik but want to comply with German law as well...

anonymous-matomo-user commented 14 years ago

The discussion about the use of Google Analytics in Germany is being continued. There will probably be no legal restrictions against it, but a workgroup of privacy boards seems to be working on a list of recommendations for website owners, that excludes the use of GA. (German source: http://www.zeit.de/digital/datenschutz/2009-11/google-analytics-datenschutz?page=all)

This means that the interest in a self-hosted statistics tool like Piwik could rise soon, if it complies with the recommendations (does not permanently store IP addresses).

Could we move this feature request to an earlier milestone, like 0.6?

mattab commented 14 years ago

There is a lot of interest from German users to make this happen. If Piwik can be the solution for german websites, this would greatly help the german community and would help Piwik. However, I would recommend doing this in core with a enable/disable setting rather than in a plugin.

anonymous-matomo-user commented 14 years ago

Thank you for changing the milestone.

I made a quick try to make a patch out of my above suggestion. Maybe it helps as a suggestion only, as I'm not a programmer.

robocoder commented 14 years ago

The solution is non-trivial.

anonymous-matomo-user commented 14 years ago

Google Analytics is being criticized mostly because the servers are outside the EU and Google will cooporate with legal authorities worldwide, who are not all complying to the laws and regulations of the EU. Additionally, Google will give no guaranty that it will not use your data for their own purpose. Here Piwik is already leading the discussion :-)

But: Storing the IP address is not recommended, no matter wether talking about Google or any other webtracking, wether they are based in Germany or anywhere else (like eTracker or WebTrekk). According to a resolution (see below), it is not even allowed to process IP addresses, less storing them. This results in

The resolution, issued by the "obersten Aufsichtsbehrden fr den Datenschutz im nicht-ffentlichen Bereich", can be found here at [http://www.lfd.m-v.de/dschutz/beschlue/Analyse.pdf].

German webtracking companies like eTracker allow their customers already a setting conforming to this paper.

Since I am not a lawer, I have no idea as to how binding this resolution is or if you still have a choice or what will happen if you don't adhere to it.

But I think it would be a good idea to admins users the choice like some companies do, and I would prefer to be able to do this in the core settings (no plugin).

anonymous-matomo-user commented 14 years ago

Couldn't this be done in the core quite easily by modifying just one critical line of code? I'm not sure if I reference the correct line in my examples, but there must be a line of code in Piwik which accesses the IP address the first time. So I'll continue without knowing whether I'm talking about the right code example. Please read the following just as an informal proposal guessing this would be the correct line in the code :-) .

1) The 100% solution would be to throw the IP completely away the first time Piwik sees it. According to some blog posts in the www this happens in /core/Tracker/Visit.php (this is pure theory, I absolutely have no clue ;-) ):

'location_ip' => $userInfo['location_ip'],

=>

'location_ip' => 0,

So there would be an empty location_ip field in the database. As far as I read, Piwik would still be working except of a few plugins. I think it would be very cool to have an option "Throw IP away" (for hardliner germans :-) which would be quite a few I guess).

2) A medium approach could be the use of hashes. I don't think, it's necessary to store the complete output of MD5 to avoid all collisions (mathematically there also can be collisions in the 16 byte md5 output of PHP). So I propose to accept some risk of collisions and to store the hash-output in the location_ip:

'location_ip' => hexdec(substr(md5($userInfo['location_ip']), -8)),

The code is completely untested and shall only demonstrate the idea! Taking the last 4 bytes or the first ones shouldn't make a difference: On modification of one bit in the input of a valid hash function every bit in the output flips with the same probability of 50%. This approach could be expressed in an option "Use 4 byte hash".

3) I would take one further step and use 1 of the 4 bytes of the hashed IP to mark that this is no valid IP address. E.g. this could be achieved by using the 0.X.X.X/8 subnet:

'location_ip' => hexdec(substr(md5($userInfo['location_ip']), -6)),

This approach ("Use 3 byte hash") would result in more then 16 million possible hash values which should be enough for the most use cases I guess.

I really appreciate your work and I think this would be a great feature for many many german sites especially in the governmental area. Your approach of logging already matches the requirements of german privacy laws very good. The IP killing feature would be the perfection no other approach could guarantee that easy.

anonymous-matomo-user commented 14 years ago

http://www.braekling.de/web-development/2758-mal-wieder-piwik-und-anonymisierte-ips.html

anonymous-matomo-user commented 14 years ago

I really miss the point of saving the hash. I wrote a plugin ( #1168 ) that trims the IP right before the data is saved to the database. The hashing over configuration and IP (to recognize users in "recognizeTheVisitor") is done before - there is no conflict. I've written a post to explain my arguments (english & german).

robocoder commented 14 years ago

(In [1877]) fixes #692 - plugin (deactivated by default) to anonymize visitor IP addresses; the number of octets to mask is configurable; let me know if I've missed any edge cases in the unit tests

anonymous-matomo-user commented 14 years ago

Great, thanks. Works as expected.

anonymous-matomo-user commented 14 years ago

Great, thank you!

mattab commented 14 years ago

great Anthon! this will make the German very happy.

anonymous-matomo-user commented 14 years ago

Thank you, but where do I configure the number of octets?

anonymous-matomo-user commented 14 years ago

The Live Visitor-Plugin still shows IPs :-(

anonymous-matomo-user commented 14 years ago

@jimbo Are you sure that you are working with Piwik 0.5.5? The IPs shouldn't be listed in the Live Visitor-Plugin anymore.

You can define the number of octets in your config.ini.php: [Tracker] ip_address_mask_length = 2