thegooddata / webapp

TheGoodData web application
http://www.thegooddata.org
GNU General Public License v3.0
3 stars 2 forks source link

Automatically update webtrackers with adblockplus and privacy badger data #152

Open marcosmenendez opened 9 years ago

marcosmenendez commented 9 years ago

We should update the trackers to be blocked. We should keep our classification for the existing trackers

Source to consider are: adblockplus: https://easylist-downloads.adblockplus.org/easyprivacy.txt privacybadger: https://github.com/EFForg/privacybadgerchrome/blob/master/doc/sample_cookieblocklist.txt

The second, as you can read in the faqs (https://www.eff.org/privacybadger) does not have a blacklist as it, but a yellow list of sites to which it blocks its cookies but may allow its content if the site does observe the Do Not Track messages sent by the extension.

Please delete code written for #141 and try to build a routine that identifies trackers that are not on our list based on the above, propose the most proper category (content, advertising, etc) and let the admin change that category. Check that the list does not become so big that uses too much memory.

This routine should also identify trackers that are included in our list but not in ABP, asking admin whether we should keep or delete them

Other lists we may look at are:

JorgelieHD commented 8 years ago

@marcosmenendez, @atrandafir and myself have been investigating carefully about this issue.

We made some conclusions:

  1. The sources you pointed out are not too useful. This sources are not just host-based, they're also based on regular expressions and files. This will need a different code to detect new domains since TGD extension is only host-based. And this represents and important workload.
  2. There are cases that uses only domain but do not provide information to identify categories.
  3. I've been looking into disconnect me list and it has a different size than TGDs. TGD list's size almost doubles the one in this link https://github.com/disconnectme/disconnect-tracking-protection. What it makes me thinks that maybe this list is not updated or there's another one. In the case that we decide to work with disconnect me we have to be sure that this is a reliable source, because I haven't found another one that seems more updated.
  4. Keep in mind that currently the webapp does not handle any of this, it is a file in TGD extension. Every time that extension detects a domain that is in this file, it will send this domain through the API controller and then this will be stored in the webapp's DB. So a full automation is almost impossible, the only way will be to update this file through webapp API comparing TGD's list (previously stored in DB) with disconnect me's list.

We have some ideas on how to do this but looking into disconnectme github I realize that there haven't been any updates since two days after deploy (that is almost 3 months). So, if we decide to update our list with the API comparing with disconnect me list and this is not updated regularly, this work might be a waste of time.