Closed anonymous-matomo-user closed 14 years ago
Attachment: suggestion.diff
(Do German companies disable their firewall and web server log files too?)
Replying to vipsoft:
(Do German companies disable their firewall and web server log files too?)
Well, some court decisions say the IP adress is a personal information (at least you can track it down to the ISP customer) some say it is not. And still we have some fragments of data privacy... However, i don't think that most companies swtich off the logging of ip adresses, but it is needed for different reasons (intrusion detection e.g.). Also, most forums and other such software store the ip address too. Btw, any staff member who has insight to private data of customers has to sign a confidentiality undertaking.
The problem with Google Analytics and any outsourced web analytics solution is that private date is given away to other companies. This may be okay in Germany. This could work in the EU. But it is risky when transfered to other countries with a lower or quite different privacy policy. Thus, some institutions have the opinion that the usage of Google Analytics in Germany is illegal (http://www.internetworld.de/Nachrichten/News/Datenschuetzer-halten-Google-Analytics-fuer-rechtswidrig).
All told, the judicial situation is vague and unclear. I'd say as long as you don't give the data away and have it stored savely, you have nothing to fear.
Admittedly, I am not a lawyer.
(Do German companies disable their firewall and web server log files too?) I think most of them don't do this. But it would be more secure for a company to do this, if the company wants to be sure not to violate privacy.
If we store a hash of the IP address, I think the entry can still be related back to the IP. If an authority seized a server running Piwik, they would still be able to prove that a certain person has accessed it by just calculating the hash of the suspect's IP and comparing it to the database. So privacy does not just mean protection from the owner of the server.
My suggestion: Just store a truncated hash of the IP.
Code example:
$ip='10.11.12.13';
$ipAsLong=ip2long($ip);
$ipAsHash=hexdec(md5($ipAsLong));
$anonIpAsLong=substr(number_format($ipAsHash,0,',',''),0,9);
$backToIp=long2ip($anonIpAsLong); // Result: 25.68.34.54
This way, Piwik would still have an IP to work with. But it cannot be related back to a real IP, because the hash is not complet. Yes, hash collisions will probably occur. But as the IP address is only one part of the user identification, it will not occur very often that me miss a visitor. Plus there's no need to change the table for the large full hash.
I really don't know anything about Piwik's plugin system, but in #312 it looks pretty easy to provide this kind of functionality.
Is this stupid? Will this break any other features?
Looking at the code, I'm afraid it is not possible to catch all access to the IP address with plugin hooks. So the change should probably be done in the core, which seems to have been turned down already on #312. Still, this could be a quick fix for paranoid people:
In core/Common.php, change the getIp() function like this
static public function getIp(){
return sprintf("%u", (int)substr(number_format(hexdec(md5(ip2long(self::getIpString()))),0,',',''),0,9) );
}
joux: it should be possible to catch all accesses (now that #825 [1344] has been committed to SVN).
As for a truncated hash... In the case where an "authority" seizes a server, then they inherently have the authority to inspect more than just the database, right? (e.g., server logs) Hashing would seem to strike a balance between user privacy and cooperating with law enforcement.
That said... a requirement for this plugin should be to implement a framework for anonymizing IP addresses. Site operators can then customize/extend the implementation to suit their needs, since anonymity and functionality are (roughly) inversely proportional to each other.
As far as I've seen, the IP from the database is already used in recognizeTheVisitor(), in order to decide whether the user is known. So neither Tracker.newVisitorInformation nor Tracker.knownVisitorInformation have been called so far. In order to keep the (anonymized) IP as a possibility to recognize a user, the matched IP must be hashed+truncated at that point, too.
Maybe yet another hook in getUserSettingsInformation() would allow for a generic filtering possibility before the data is used/saved anywhere. But that's just a quick guess.
I would like to support the feature request and maybe add some hint for easier implementation:
$ip_hash = md5($ip) MOD 2^64; // make $ip_hash fit into a BigInt which is of 8 bytes size.
Can anyone tell me, where in the core I would have to change this? I want to continue using Piwik but want to comply with German law as well...
The discussion about the use of Google Analytics in Germany is being continued. There will probably be no legal restrictions against it, but a workgroup of privacy boards seems to be working on a list of recommendations for website owners, that excludes the use of GA. (German source: http://www.zeit.de/digital/datenschutz/2009-11/google-analytics-datenschutz?page=all)
This means that the interest in a self-hosted statistics tool like Piwik could rise soon, if it complies with the recommendations (does not permanently store IP addresses).
Could we move this feature request to an earlier milestone, like 0.6?
There is a lot of interest from German users to make this happen. If Piwik can be the solution for german websites, this would greatly help the german community and would help Piwik. However, I would recommend doing this in core with a enable/disable setting rather than in a plugin.
Thank you for changing the milestone.
I made a quick try to make a patch out of my above suggestion. Maybe it helps as a suggestion only, as I'm not a programmer.
The solution is non-trivial.
Google Analytics is being criticized mostly because the servers are outside the EU and Google will cooporate with legal authorities worldwide, who are not all complying to the laws and regulations of the EU. Additionally, Google will give no guaranty that it will not use your data for their own purpose. Here Piwik is already leading the discussion :-)
But: Storing the IP address is not recommended, no matter wether talking about Google or any other webtracking, wether they are based in Germany or anywhere else (like eTracker or WebTrekk). According to a resolution (see below), it is not even allowed to process IP addresses, less storing them. This results in
The resolution, issued by the "obersten Aufsichtsbehrden fr den Datenschutz im nicht-ffentlichen Bereich", can be found here at [http://www.lfd.m-v.de/dschutz/beschlue/Analyse.pdf].
German webtracking companies like eTracker allow their customers already a setting conforming to this paper.
Since I am not a lawer, I have no idea as to how binding this resolution is or if you still have a choice or what will happen if you don't adhere to it.
But I think it would be a good idea to admins users the choice like some companies do, and I would prefer to be able to do this in the core settings (no plugin).
Couldn't this be done in the core quite easily by modifying just one critical line of code? I'm not sure if I reference the correct line in my examples, but there must be a line of code in Piwik which accesses the IP address the first time. So I'll continue without knowing whether I'm talking about the right code example. Please read the following just as an informal proposal guessing this would be the correct line in the code :-) .
1) The 100% solution would be to throw the IP completely away the first time Piwik sees it. According to some blog posts in the www this happens in /core/Tracker/Visit.php (this is pure theory, I absolutely have no clue ;-) ):
'location_ip' => $userInfo['location_ip'],
=>
'location_ip' => 0,
So there would be an empty location_ip field in the database. As far as I read, Piwik would still be working except of a few plugins. I think it would be very cool to have an option "Throw IP away" (for hardliner germans :-) which would be quite a few I guess).
2) A medium approach could be the use of hashes. I don't think, it's necessary to store the complete output of MD5 to avoid all collisions (mathematically there also can be collisions in the 16 byte md5 output of PHP). So I propose to accept some risk of collisions and to store the hash-output in the location_ip:
'location_ip' => hexdec(substr(md5($userInfo['location_ip']), -8)),
The code is completely untested and shall only demonstrate the idea! Taking the last 4 bytes or the first ones shouldn't make a difference: On modification of one bit in the input of a valid hash function every bit in the output flips with the same probability of 50%. This approach could be expressed in an option "Use 4 byte hash".
3) I would take one further step and use 1 of the 4 bytes of the hashed IP to mark that this is no valid IP address. E.g. this could be achieved by using the 0.X.X.X/8 subnet:
'location_ip' => hexdec(substr(md5($userInfo['location_ip']), -6)),
This approach ("Use 3 byte hash") would result in more then 16 million possible hash values which should be enough for the most use cases I guess.
I really appreciate your work and I think this would be a great feature for many many german sites especially in the governmental area. Your approach of logging already matches the requirements of german privacy laws very good. The IP killing feature would be the perfection no other approach could guarantee that easy.
I really miss the point of saving the hash. I wrote a plugin ( #1168 ) that trims the IP right before the data is saved to the database. The hashing over configuration and IP (to recognize users in "recognizeTheVisitor") is done before - there is no conflict. I've written a post to explain my arguments (english & german).
(In [1877]) fixes #692 - plugin (deactivated by default) to anonymize visitor IP addresses; the number of octets to mask is configurable; let me know if I've missed any edge cases in the unit tests
Great, thanks. Works as expected.
Great, thank you!
great Anthon! this will make the German very happy.
Thank you, but where do I configure the number of octets?
The Live Visitor-Plugin still shows IPs :-(
@jimbo Are you sure that you are working with Piwik 0.5.5? The IPs shouldn't be listed in the Live Visitor-Plugin anymore.
You can define the number of octets in your config.ini.php: [Tracker] ip_address_mask_length = 2
In Germany, Piwik would be a much better alternative to Google Analytics, if there would be an option to deactivate the storing of ip-addresses. Maybe ip-addresses could optionally be saved "hashed" instead of "cleartext"?
Background: In Germany the situation is not clear - maybe storing of ip-addresses is not allowed because it harms privacy. See: [http://forum.piwik.org/index.php?showtopic=825].
Piwik would be the first software I know with this "Feature" - so every German company could use it without probably getting problems with privacy law.