tweaselORG / experiments

Smaller one-off experiments/research projects that don't warrant their own repo.
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

First traffic collection on the web #3

Open baltpeter opened 2 weeks ago

baltpeter commented 2 weeks ago

So far, all of our traffic collections have been about mobile apps on Android and iOS. We are now working on extending Tweasel for the web, so we also need data on tracking requests on the web.

baltpeter commented 1 week ago

The first decision we need to make is which websites we want to visit.

After looking at the list of lists included in Tranco, I think the Chrome User Experience Report (CrUX) is going to be the best for our use case.

The other lists (and thus, Tranco itself) are too biased towards DNS lookups as opposed to actual websites, which we are interested in. For example, currently there are at least 12 rows in the top 25 of the Tranco list that are CDNs, DNS servers, etc., but not websites:

amazonaws.com
akamai.net
a-msedge.net
root-servers.net
akamaiedge.net
gstatic.com
tiktokcdn.com
googletagmanager.com
googlevideo.com
gtld-servers.net
akadns.net
windowsupdate.com

As far as accessing the CrUX data goes, there is https://github.com/crissyfield/crux-dumps, which has the very laudable goal of relieving you from having to deal with BigQuery. :D

I looked at the top 10k as of 2024/08. Unfortunately, I don't think that is going to be helpful for our use case, either. For example, look at the included entries for www.google.*:

https://www.google.bg
https://www.google.ch
https://www.google.co.il
https://www.google.co.nz
https://www.google.co.za
https://www.google.com.eg
https://www.google.com.my
https://www.google.com.pk
https://www.google.com.sa
https://www.google.com.sg
https://www.google.com.ua
https://www.google.dk
https://www.google.fi
https://www.google.hr
https://www.google.ie
https://www.google.sk

So, I had to use BigQuery after all to grab the top 10k for Germany using the following query:

SELECT DISTINCT origin, experimental.popularity.rank FROM `chrome-ux-report.country_de.202408` WHERE experimental.popularity.rank <= 10000

Dump: bquxjob_5ded4b14_19209284743.json, bquxjob_5ded4b14_19209284743.csv

baltpeter commented 5 days ago

The analysis has been running for a few days now. Code is in: https://github.com/tweaselORG/experiments/tree/b_web-monkey-september-2024