tweaselORG / ReportHAR

Library for generating technical reports, notices, and GDPR complaints concerning tracking by mobile apps for the tweasel project.
MIT License
0 stars 0 forks source link

Complaint request filter: Match against endpoint URLs from adapter instead of hostname from HAR #6

Open baltpeter opened 6 months ago

baltpeter commented 6 months ago

In e9e52478e3572445357488ff81964b2f56a52c1f, I implemented a filter that only includes requests to servers that the user's device also provably (through Tracker Control/the App Privacy Report) contacted.

I am currently doing that by checking the request's hostname from the HAR against the hostnames in the TC/APR export.

Instead of the HAR hostname, I think we should be checking against all endpoint URLs that the corresponding adapter accepts.

Imagine a tracking endpoint https://api\d.tracker.tld/ingest. If during our analysis, we happened to find requests to https://api2.tracker.tld/ingest but the user's device happened to use https://api5.tracker.tld/ingest instead, we would currently exclude those requests.

However, implementing it this way is surprisingly hard. We only get a hostname from the TC/APR export. Meanwhile, our adapters' endpoint URLs can be strings or regexes of full URLs.

How would we check whether android2-ads.adcolony.com matches /^https:\/\/(android|ios)?ads\d-?\d\.adcolony\.com\/configure$/? Maybe I'm missing something, but I really can't see an automated way that isn't hacky and error-prone.

I feel like the only (proper) way to implement this change would be to also manually add a hosts array to each adapter in TrackHAR.

baltpeter commented 2 weeks ago

@zner0L What do you think?

zner0L commented 6 days ago

I just thought of creating a kind of Endpoint object which would allow to decompose the endpoint URL similarly to the URL JS object. For regexes, we could combine a host regex with a path and protocol component (like this: https://stackoverflow.com/questions/9213237/combining-regular-expressions-in-javascript).