the-markup / blacklight-collector

GNU General Public License v3.0
195 stars 36 forks source link

Difference between blacklight website and the results in inspection.json #83

Closed alexnielson closed 6 months ago

alexnielson commented 6 months ago

Hello!

Thank you so much for making this tool and software. It has been really easy to work with your tools, so I wanted to express my gratitude. My team is using this to examine government websites and came across a strange issue where the inspection.json and the results shown on the website https://themarkup.org/blacklight do not match. I have read the methodology https://themarkup.org/blacklight/2020/09/22/how-we-built-a-real-time-privacy-inspector, but cannot find out a clear rule for the difference on the front end website.

For example, if you use the website and search the following url: https://www.townoforderville.com/ only one third party tracker is shown on the website. But the if you download the archive, or use this repo to get an inspection.json, the inspection.json includes 31 elements, and their are 8 unique third party domains. Is the blacklight website's front end performing some type of filtering of the inspection.json file?

Thank you ,

BatMiles commented 6 months ago

Hey Alex, thanks for reaching out! I'm glad you're getting good use out of Blacklight.

You're correct, there's an extra level of processing between the inspection.json results, and what the user sees on The Markup's website. Inspection.json's third_party_trackers array collects all requests the page makes to external sites known to be trackers. Specifically, it includes those scripts that appear on the EasyList and EasyPrivacy lists. The web page specifically reports ad trackers, which are those trackers in the third_party_trackers list that are categorized as ‘Ad Motivated Tracking’ by Duck Duck Go’s Tracker Radar. Duplicates are then removed.

I'm happy to answer any other questions about Blacklight's workings! Feel free to reach out to blacklight@themarkup.org, I monitor it daily. We also offer an informal API for folks using Blacklight for research or public project purposes, if you're interested, reach out at the blacklight email.

alexnielson commented 6 months ago

@BatMiles You rock! Thank you for getting back to me so quickly. That makes a lot of sense regarding the filter on easyprivacy and easylist files. I will definitely reach out via email regarding api access.