Create public request database

baltpeter commented 1 year ago

The barrier of entry for looking into how individual trackers work is currently very high. To be able to reliably reverse-engineer endpoints, you need a large dataset of (mobile app) traffic with hundreds of thousands of requests from thousands of apps. Even though our tools and libraries make it easier to collect such a dataset, this still makes it really hard for anyone who hasn't spent significant prior effort. And they won't have access to historical data either way. That is obviously not in our interest. Our goal is to shine a light on how trackers work and what they collect, and as such we of course want as many people as possible researching them.

This is also relevant for TrackHAR. We need to create documentation on why/how we have concluded what certain values transmitted to a tracking endpoint mean (https://github.com/tweaselORG/TrackHAR/issues/9). To be able to do that in a way that is replicable by others, we need to be able to reference the actual requests we have based our conclusions on.

As such, we want to create a public repository/database of tracking requests. We already have multiple large datasets of mobile tracking requests from previous research projects and will be continuously collecting new requests through our platform.

This will finally also greatly benefit trackers.tweasel.org. We already have static example values for the existing adapters from one of our datasets (https://github.com/tweaselORG/tracker-wiki/issues/3). With a public request database, we could be computing this data on the fly and even provide a long list of real example requests for each adapter.

baltpeter commented 1 year ago

I think Datasette would be a great tool for this. That provides a really nice web interface + exports in common formats + APIs for any SQLite database.

Plugins we may want to consider:

baltpeter commented 1 year ago

All our current datasets are in PostgreSQL databases. How can I dump those to SQLite? Other than StackOverflow answers suggesting to modify the pgdump file to remove unsupported statements, I've found two tools: https://github.com/simonw/db-to-sqlite and https://github.com/caiiiycuk/postgresql-to-sqlite

I tried db-to-sqlite to make an SQLite database out of one view in the database from my MT:

pipx install 'db-to-sqlite[postgresql]'
db-to-sqlite "postgresql://<user>:<password>@<host>:<port>/<database>" ma.db --table filtered_requests

That produced a ~335MB file after 1.5mins. I opened that with Datasette:

datasette ma.db

And everything is working as expected. But queries take ~1s.

EDIT: That's not quite true. For some reason, listing the first page takes almost a second. But the subsequent pages take only ~50ms. Even a query like select * from filtered_requests where host like '%adcolony%'; that returns 1000 results only takes 52ms. So, I don't think that this an issue.

baltpeter commented 1 year ago

One thing we need to keep in mind: In rowid tables, Datasette (expectedly) uses the rowid to identify (and importantly) link to individual rows. However, those are not stable:

If the rowid is not aliased by INTEGER PRIMARY KEY then it is not persistent and might change. In particular the VACUUM command will change rowids for tables that do not declare an INTEGER PRIMARY KEY. Therefore, applications should not normally access the rowid directly, but instead use an INTEGER PRIMARY KEY. https://www.sqlite.org/rowidtable.html

There are also WITHOUT ROWID tables for which one needs to define an explicit primary key in the create table statement. I tested that with a composite primary key. Datasette handled that just fine. It then uses all elements of the primary key for the URL.

baltpeter commented 1 year ago

So, now we need to decide on a common schema that we can transform the existing datasets to. Luckily, they all use a very similar schema (e.g. https://github.com/baltpeter/thesis-mobile-consent-dialogs/blob/main/schema.sql, https://github.com/datenanfragen/android-data-safety-label-analysis/blob/main/schema.sql, https://github.com/baltpeter/ios-privacy-violations/blob/main/schema.sql).

I think it's a good idea to use TrackHAR's Request type as a base (that in turn was modeled after the previous schemas :D):

export type Request = {
    /** The time when the request was sent. */
    startTime: Date;
    /** The HTTP method used. */
    method: string;
    /** The host name of the request. */
    host: string;
    /** The full path of the request, including the query string. */
    path: string;
    /** The full URL, but without the query string. This is useful for matching in the adapters. */
    endpointUrl: string;
    /** The request body, if any. */
    content?: string;
    /** The port of the request. */
    port: string;
    /** The scheme of the request. */
    scheme: 'http' | 'https';
    /** The HTTP version of the request. */
    httpVersion: string;
    /** The headers included in the request. */
    headers?: { name: string; value: string }[];
    /** The cookies set through request. */
    cookies?: { name: string; value: string }[];
};

That would mean listing headers and cookies as JSON arrays in the rows instead of having separate tables for them. But to be honest, I prefer that for this use case anyway. Makes the schema simpler and otherwise, you'd probably pretty much always be JOINing against those auxiliary tables anyway. Also, I'm pretty sure that we won't lose any query features if we install the jq extension I listed above.

baltpeter commented 1 year ago

We then need to extend that with the information in my filtered_requests view: app ID, app version, platform. I don't think we care about the concept of runs that the existing schemas have. I'm not sure whether we want to encode information about whether a consent dialog was accepted/rejected. I don't really think that matters here.

I would use a composite primary key: The ID from the original dataset (can be whatever) plus a dataset column (we could even have a separate table that has more metadata about the datasets). That way, we also don't need to worry about ID collisions between datasets.

baltpeter commented 1 year ago

We then need to extend that with the information in my filtered_requests view: app ID, app version, platform.

Actually, on second thought, maybe we want to be more flexible. What if we later want to add requests from websites? platform: web works already, but websites don't have an app ID and version. Maybe we have an initiator column instead that is <app ID>@<version> for apps and the URL for websites?

baltpeter commented 1 year ago

That would leave us with something like this:

CREATE TABLE "requests" (
    "id"    TEXT,
    "dataset"   TEXT,
    "initiator" TEXT,
    "platform"  TEXT NOT NULL,
    "runType"   TEXT,
    "startTime" TEXT NOT NULL,
    "method"    TEXT NOT NULL,
    "httpVersion"   TEXT NOT NULL,
    "scheme"    TEXT NOT NULL,
    "host"  TEXT NOT NULL,
    "port"  INTEGER NOT NULL,
    "path"  TEXT NOT NULL,
    "content"   BLOB,
    "headers"   TEXT NOT NULL,
    "cookies"   TEXT NOT NULL,
    PRIMARY KEY("dataset","id")
) WITHOUT ROWID;

Some notes:

I deliberately left the initiator column nullable. I think it can also be valuable to have requests that we can't attribute to a specific app (e.g. from the OS background noise).
I did add a nullable column for run types, after all. I wouldn't enforce an enum across different datasets for those, but that allows us to import some additional context from the datasets.
I didn't include the endpoint URL here yet. I think we'll have one interim table per dataset and then create a view that merges them and allows us to introduce such concepts globally.

baltpeter commented 1 year ago

I think the best way to get the data from the datasets into this format is to construct a temporary view.

First step for my master's thesis:

select filtered_requests.id                                           as id,
       'informed-consent'                                             as dataset,
       filtered_requests.name || '@' || version                       as initiator,
       platform,
       run_type                                                       as runType,
       start_time                                                     as startTime,
       method,
       http_version                                                   as httpVersion,
       scheme,
       host,
       port,
       content_raw                                                    as content,
       coalesce((select json_agg(json_build_object('name', h.name, 'value', h.values[1]))
                 from headers h
                 where h.request = filtered_requests.id), '[]'::json) as headers,
       coalesce((select json_agg(json_build_object('name', c.name, 'value', c.values[1]))
                 from cookies c
                 where c.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests;

But I still need to add the background traffic with initiator: null.

baltpeter commented 1 year ago

Here's the query I used to create the view that includes both types of requests:

create view informedConsentRequests as

(select filtered_requests.id                                           as id,
        'informed-consent'                                             as dataset,
        filtered_requests.name || '@' || version                       as initiator,
        platform,
        run_type                                                       as runType,
        start_time                                                     as startTime,
        method,
        http_version                                                   as httpVersion,
        scheme,
        host,
        port,
        content_raw                                                    as content,
        coalesce((select json_agg(json_build_object('name', h.name, 'value', h.values[1]))
                  from headers h
                  where h.request = filtered_requests.id), '[]'::json) as headers,
        coalesce((select json_agg(json_build_object('name', c.name, 'value', c.values[1]))
                  from cookies c
                  where c.request = filtered_requests.id), '[]'::json) as cookies
 from filtered_requests)

union all

(select requests.id                                           as id,
        'informed-consent'                                    as dataset,
        null                                                  as initiator,
        platform,
        run_type                                              as runType,
        requests.start_time                                   as startTime,
        method,
        http_version                                          as httpVersion,
        scheme,
        host,
        port,
        content_raw                                           as content,
        coalesce((select json_agg(json_build_object('name', h.name, 'value', h.values[1]))
                  from headers h
                  where h.request = requests.id), '[]'::json) as headers,
        coalesce((select json_agg(json_build_object('name', c.name, 'value', c.values[1]))
                  from cookies c
                  where c.request = requests.id), '[]'::json) as cookies
 from requests
          join runs r on r.id = requests.run
          join apps a on r.app = a.id
 where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id));

The subqueries for the headers and cookies make this very slow but that's not too much of a problem since I only want to export the full view once and then never touch it again.

baltpeter commented 1 year ago

I severely underestimated how long this was going to take. The db-to-sqlite command ran for 6.5 hours! :o

In the meantime, I tried creating a second materialized view hoping that that would be faster, but I aborted that after almost two hours when db-to-sqlite had finished.

I have no idea what is going on. The two queries making up the union "only" take less than two minutes each to run. How can combining them into a view take this ridiculously long?

baltpeter commented 1 year ago

Okay, I think I've found the source of my confusion. I'm pretty sure DataGrip helpfully automatically inserted a limit 501 to my select queries because it only shows 500 rows in the preview anyway (and it didn't show me a row count).

Even if manually add a limit 1000, the query takes the same time, it only shows me 500 rows and the row count is displayed as 501+. If I click the "next page" button, loading takes another two minutes. That confirms my theory. Pretty unintuitive design here, DataGrip…

baltpeter commented 1 year ago

Okay. But if computing 500 rows already takes 2 mins, I shouldn't be surprised that 220k rows take a long, long time. :D

So, the subqueries are a lot worse than I figured. But upon now thinking about this more this does actually make sense. There are 1.6M header rows and 150k cookie rows. It has to go through all these for 220k rows…

I confirmed this by running the query with EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON) (with a limit 501 — didn't want to wait another couple of hours :D) and loading the result into explain.dalibo.com: https://explain.dalibo.com/plan/d6b7gc26a17c54a5

That clearly shows that the headers and cookies (though the latter much less so) are the problem:

baltpeter commented 1 year ago

My first idea for optimisation was to precompute the headers and cookies JSON like this:

with cookiesJson as (select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json
                     from cookies c
                     group by c.request),
     headersJson as (select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json
                     from headers h
                     group by h.request)
select filtered_requests.id                                                                                  as id,
       'informed-consent'                                                                                    as dataset,
       filtered_requests.name || '@' || version                                                              as initiator,
       platform,
       run_type                                                                                              as runType,
       start_time                                                                                            as startTime,
       method,
       http_version                                                                                          as httpVersion,
       scheme,
       host,
       port,
       content_raw                                                                                           as content,
       coalesce((select json from headersJson where headersJson.request = filtered_requests.id), '[]'::json) as headers,
       coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests;

However, that didn't help at all. The query still takes the exact same amount of time.

baltpeter commented 1 year ago

I don't really understand why that is. Computing cookiesJson and headersJson is super fast and they only have 16k and 220k rows each, so I would have expected at least some speedup.

Looking at the query plan, it's almost the exact same with just an additional "Subquery Scan" for each of the subqueries. Clearly, I'm misunderstanding something here.

baltpeter commented 1 year ago

It appears as though the common table expressions (the with queries at the top) don't work in the way I expected them to. It seems like they don't precompute those values after all (but instead, they are re-executed for each row?).

If I instead make them into materialized views like this:

create materialized view cookiesJson as (select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json
                     from cookies c
                     group by c.request);
create materialized view headersJson as (select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json
                     from headers h
                     group by h.request);

select filtered_requests.id                                                                                  as id,
       'informed-consent'                                                                                    as dataset,
       filtered_requests.name || '@' || version                                                              as initiator,
       platform,
       run_type                                                                                              as runType,
       start_time                                                                                            as startTime,
       method,
       http_version                                                                                          as httpVersion,
       scheme,
       host,
       port,
       content_raw                                                                                           as content,
       coalesce((select json from headersJson where headersJson.request = filtered_requests.id), '[]'::json) as headers,
       coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests limit 501;

Now the select only takes 16s instead of 2mins. Still not great, but a heck of a lot better.

15s * (220000 rows / 500 rows) = 6600 s = 110 mins

I'm sure this can be improved a lot more. We should probably be using joins instead of those subqueries and then group by filtered_requests.id. But then Postgres complains about the other columns not appearing in the group by. And if I add all those, the row count differs from what I expect.

Whatever. Considering how few datasets in that legacy format we need to import, I think I'll just take the ~80% improvement. Those also have fewer requests.
And I value my productive time a lot more than my PC's compute time.

baltpeter commented 1 year ago

One annoyance of the schema we ended with: Because content is now a blob, Datasette displays it like this even if the value is actually plaintext:

In the old schema, we had both a content (text) and a content_raw (blob) column. But I really don't think that should be necessary.

DB Browser for SQLite displays this as I would expect:

baltpeter commented 1 year ago

I thought that SQLite's weak type system might help us here:

A column with TEXT affinity stores all data using storage classes NULL, TEXT or BLOB. If numerical data is inserted into a column with TEXT affinity it is converted into text form before being stored. https://www.sqlite.org/datatype3.html

But alas, it doesn't. If I alter the content column to be text instead, everything is still fine in DB Browser for SQLite but Datasette unfortunately also still renders everything as a blob.

baltpeter commented 1 year ago

But that's okay. If I install the datasette-render-binary plugin (pipx inject datasette datasette-render-binary), we now get this view:

Still not ideal (because now the content column doesn't wrap and is ridiculously long) but we can fix this with CSS or our own custom plugin. Either way, not a priority right now.

baltpeter commented 1 year ago

Here's how I exported the dataset from the data safety label analysis.

First I created the views:

create materialized view cookiesJson as (
    select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json from cookies c group by c.request
);
create materialized view headersJson as (
    select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json from headers h group by h.request
);

create materialized view "worryingConfessionsRequests" as

(select filtered_requests.id                                                                                  as id,
        'worrying-confessions'                                                                                         as dataset,
        filtered_requests.name || '@' || version                                                              as initiator,
        'android'                                                                                             as platform,
        'no-interaction'                                                                                      as "runType",
        start_time                                                                                            as "startTime",
        method,
        http_version                                                                                          as "httpVersion",
        scheme,
        host,
        port,
        path,
        content_raw                                                                                           as content,
        coalesce((select json from headersJson where headersJson.request = filtered_requests.id),
                 '[]'::json)                                                                                  as headers,
        coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
 from filtered_requests)

union all

(select requests.id                                                                                  as id,
        'worrying-confessions'                                                                                as dataset,
        null                                                                                         as initiator,
        'android'                                                                                    as platform,
        'no-interaction'                                                                             as "runType",
        requests.start_time                                                                          as "startTime",
        method,
        http_version                                                                                 as "httpVersion",
        scheme,
        host,
        port,
        path,
        content_raw                                                                                  as content,
        coalesce((select json from headersJson where headersJson.request = requests.id), '[]'::json) as headers,
        coalesce((select json from cookiesJson where cookiesJson.request = requests.id), '[]'::json) as cookies
 from requests
          join runs r on r.id = requests.run
          join apps a on r.app = a.id
 where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id));

This only took 23s (but that dataset also only has 20k requests).

Then I exported that view to an SQLite database:

db-to-sqlite "postgresql://dsl:<pw>@localhost:5432/dsl" worrying-confessions.db --table worryingConfessionsRequests --progress

Finally, I renamed the exported table to requests (the idea is that we'll have one SQLite database per dataset with a common format and then we later combine those into a single database for Datasette).

baltpeter commented 1 year ago

And for the iOS analysis:

create materialized view "iosWatchingYouRequests" as

(select filtered_requests.id                                                                                  as id,
        'ios-watching-you'                                                                                    as dataset,
        a.name || '@' || a.version                                                                            as initiator,
        'ios'                                                                                                 as platform,
        'no-interaction'                                                                                      as "runType",
        filtered_requests.start_time                                                                          as "startTime",
        method,
        http_version                                                                                          as "httpVersion",
        scheme,
        host,
        port,
        path,
        content_raw                                                                                           as content,
        coalesce((select json from headersJson where headersJson.request = filtered_requests.id),
                 '[]'::json)                                                                                  as headers,
        coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
 from filtered_requests
          join runs r on r.id = filtered_requests.run
          join apps a on r.app = a.name)

union all

(select requests.id                                                                                  as id,
        'ios-watching-you'                                                                           as dataset,
        null                                                                                         as initiator,
        'ios'                                                                                        as platform,
        'no-interaction'                                                                             as "runType",
        requests.start_time                                                                          as "startTime",
        method,
        http_version                                                                                 as "httpVersion",
        scheme,
        host,
        port,
        path,
        content_raw                                                                                  as content,
        coalesce((select json from headersJson where headersJson.request = requests.id), '[]'::json) as headers,
        coalesce((select json from cookiesJson where cookiesJson.request = requests.id), '[]'::json) as cookies
 from requests
 where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id));

Followed by:

db-to-sqlite "postgresql://ios:<pw>@localhost:5433/ios" ios-watching-you.db --table iosWatchingYouRequests --progress

Here, creating the view took a bit longer (2.5 mins for 41k requests).

And finally, I again renamed the table to requests.

baltpeter commented 1 year ago

And finally, the first Android analysis.

create materialized view cookiesJson as (
    select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json from cookies c group by c.request
);
create materialized view headersJson as (
    select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json from headers h group by h.request
);

create materialized view "doTheyTrackRequests" as

(select filtered_requests.id                                                                                  as id,
        'do-they-track'                                                                                       as dataset,
        a.name || '@' || a.version                                                                            as initiator,
        'android'                                                                                             as platform,
        'no-interaction'                                                                                      as "runType",
        filtered_requests.start_time                                                                          as "startTime",
        method,
        null                                                                                                  as "httpVersion",
        null                                                                                                  as scheme,
        host,
        null                                                                                                  as port,
        path,
        content_raw                                                                                           as content,
        coalesce((select json from headersJson where headersJson.request = filtered_requests.id),
                 '[]'::json)                                                                                  as headers,
        coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
 from filtered_requests
          join runs r on r.id = filtered_requests.run
          join apps a on r.app = a.name)

union all

(select requests.id                                                                                  as id,
        'do-they-track'                                                                              as dataset,
        null                                                                                         as initiator,
        'android'                                                                                    as platform,
        'no-interaction'                                                                             as "runType",
        requests.start_time                                                                          as "startTime",
        method,
        null                                                                                         as "httpVersion",
        null                                                                                         as scheme,
        host,
        null                                                                                         as port,
        path,
        content_raw                                                                                  as content,
        coalesce((select json from headersJson where headersJson.request = requests.id), '[]'::json) as headers,
        coalesce((select json from cookiesJson where cookiesJson.request = requests.id), '[]'::json) as cookies
 from requests
 where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id));

For that one, we unfortunately hadn't quite settled on the final schema yet and weren't collecting HTTP version, scheme, and port. I had to set those to null.

For 68k requests, creating the view took a little more than 9 mins.

Then:

db-to-sqlite "postgresql://projprakt:<pw>@localhost:5431/projprakt" do-they-track.db --table doTheyTrackRequests --progress

And finally, I again renamed the table to requests.

baltpeter commented 1 year ago

The Postgres views threw away my casing and made the column names lower-case because I didn't put them in double quotes. I manually fixed that in the SQLite tables.

baltpeter commented 1 year ago

Oh no. I forgot to export the path column. :/

baltpeter commented 1 year ago

Okay, I've recreated the databases above and updated the comments. The one for my master's thesis is still exporting. But I'll use the chance to also document the SQL for that one:

create materialized view cookiesJson as (select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json
                     from cookies c
                     group by c.request);
create materialized view headersJson as (select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json
                     from headers h
                     group by h.request);

create materialized view "informedConsentRequests" as

(select filtered_requests.id                                           as id,
        'informed-consent'                                             as dataset,
        filtered_requests.name || '@' || version                       as initiator,
        platform,
        run_type                                                       as "runType",
        start_time                                                     as "startTime",
        method,
        http_version                                                   as "httpVersion",
        scheme,
        host,
        port,
        path,
        content_raw                                                    as content,
        coalesce((select json from headersJson where headersJson.request = filtered_requests.id), '[]'::json) as headers,
        coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
 from filtered_requests)

union all

(select requests.id                                           as id,
        'informed-consent'                                    as dataset,
        null                                                  as initiator,
        platform,
        run_type                                              as "runType",
        requests.start_time                                   as "startTime",
        method,
        http_version                                          as "httpVersion",
        scheme,
        host,
        port,
        path,
        content_raw                                           as content,
        coalesce((select json from headersJson where headersJson.request = requests.id), '[]'::json) as headers,
        coalesce((select json from cookiesJson where cookiesJson.request = requests.id), '[]'::json) as cookies
 from requests
          join runs r on r.id = requests.run
          join apps a on r.app = a.id
 where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id));

And in the SQLite table, I normalized the runType names:

update requests set runType = 'no-interaction' where runType = 'initial';
update requests set runType = 'consent-dialog-accepted' where runType = 'accepted';
update requests set runType = 'consent-dialog-rejected' where runType = 'rejected';

baltpeter commented 1 year ago

We want to combine those databases into one. We start by opening a new database file:

sqlite3 data.db

Load the sqlite-regex extension (wget -O regex0.so https://github.com/asg017/sqlite-regex/releases/download/v0.2.2/linux-x86_64-regex0.so):

.load ./regex0

Then, we can attach all databases:

attach 'do-they-track.db' as doTheyTrack;
attach 'informed-consent.db' as informedConsent;
attach 'ios-watching-you.db' as iosWatchingYou;
attach 'worrying-confessions.db' as worryingConfessions;
attach 'monkey-july-2023.db' as monkeyJuly2023;

Create the table:

create table "requests" (
    "id" text,
    "dataset" text,
    "initiator" text,
    "platform" text not null,
    "runType" text,
    "startTime" text not null,
    "method" text not null,
    "httpVersion" text,
    "endpointUrl" text,
    "scheme" text,
    "host" text not null,
    "port" integer,
    "path" text not null,
    "content" blob,
    "headers" text not null,
    "cookies" text not null,
    primary key("dataset", "id")
) without rowid;

And insert all the data:

insert into requests
    select id, dataset, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies from (
        with vendors as materialized (
            select id, dataset, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies,
            case 
                when initiator is null then null
                when instr(initiator,'.') = 0 then initiator
                else regex_replace('\.[^.]+@.+?$', initiator, '')
              end as vendor,
             -- For the `do-they-track` requests, we don't know the scheme. But it really doesn't make sense to consider that for this, anyway.
             regex_replace('\?.+$', host || path, '')  as _endpointUrl
            from (
                select * from monkeyJuly2023.requests
                union all
                select * from worryingConfessions.requests
                union all
                select * from informedConsent.requests
                union all
                select * from iosWatchingYou.requests
                union all
                select * from doTheyTrack.requests
            )
        )

        select * from vendors where
            -- Only include requests that are made to the same endpointUrl by apps from at least two different vendors (https://github.com/tweaselORG/meta/issues/33#issuecomment-1658348929).
            _endpointUrl in (
                select _endpointUrl from vendors group by _endpointUrl having count(distinct vendor) >= 2
                union
                select _endpointUrl from vendors where vendor is null
            )
            -- Filter out iOS system background traffic as that may contain authentication values (https://github.com/tweaselORG/meta/issues/33#issuecomment-1660099572),
            and (not platform = 'ios' or initiator is not null)
    );

EDIT: Updated according to https://github.com/tweaselORG/meta/issues/33#issuecomment-1618651461
EDIT: Updated with data from https://github.com/tweaselORG/experiments/issues/1
EDIT: Updated according to https://github.com/tweaselORG/meta/issues/33#issuecomment-1658348929
EDIT: Updated according to https://github.com/tweaselORG/meta/issues/33#issuecomment-1660099572

baltpeter commented 1 year ago

In https://github.com/tweaselORG/meta/issues/33#issuecomment-1611339125, I decided to not include the endpoint URL in the exports from the datasets and to rather dynamically generate that through a view. The problem is that that needs a regex_replace() and SQLite doesn't have such a function.

So instead, I'll modify the schema as such:

create table "requests" (
    "id" text,
    "dataset" text,
    "initiator" text,
    "platform" text not null,
    "runType" text,
    "startTime" text not null,
    "method" text not null,
    "httpVersion" text,
    "endpointUrl" text,
    "scheme" text,
    "host" text not null,
    "port" integer,
    "path" text not null,
    "content" blob,
    "headers" text not null,
    "cookies" text not null,
    primary key("dataset", "id")
) without rowid;

And manually populated that for the existing rows by loading sqlite-regex:

wget -O regex0.so https://github.com/asg017/sqlite-regex/releases/download/v0.2.2/linux-x86_64-regex0.so
sqlite3 out.db
> .load ./regex0

And then:

update requests set endpointUrl = regex_replace('\?.+$', scheme || '://' || host || path, '') where scheme is not null;

baltpeter commented 1 year ago

We should also install the sqlite-regex Datasette extension. That will be very useful.

baltpeter commented 1 year ago

And manually populated that for the existing rows [in out.db]

Actually no, that is not a good idea. I should instead do that in the dataset exports. I want those to be immutable blobs and out.db to be something that we can always throw away and easily regenerate. Having to re-add the endpointUrl column every time is too annoying for that.

Instead, I now added the endpointUrl column to each individual dataset export.

baltpeter commented 1 year ago

It might be a good idea to start Datasette with --setting truncate_cells_html 100. The pages for individual rows still display the full data.

baltpeter commented 11 months ago

We'll also use --setting facet_time_limit_ms 1000—calculating the interesting facets now takes longer than the default timeout (200ms) and I do think that the facets are very helpful for getting an overview of the data.

baltpeter commented 11 months ago

As we don't care too much about endpoints for individual apps and want to avoid making companies unnecessarily angry about exposing their internal APIs, we have decided to only publish requests to endpoints that are contacted by at apps from at least two different vendors. As we only have app IDs, we use Apple's definition for determining the vendor:

[A]ll components of the bundle except for the last component are used to generate the vendor ID. If the bundle ID only has a single component, then the entire bundle ID is used.

Here's the query I came up with (that was quite fiddly and I ended up needing the sqlite-regex extension):

create view "filteredRequests" as
    with vendors as materialized (
        select id, dataset, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies,
        case 
            when initiator is null then null
            when instr(initiator,'.') = 0 then initiator
            else regex_replace('\.[^.]+@.+?$', initiator, '')
          end as vendor,
         -- For the `do-they-track` requests, we don't know the scheme. But it really doesn't make sense to consider that for this, anyway.
         regex_replace('\?.+$', host || path, '')  as _endpointUrl
        from requests
    )

    select * from vendors where _endpointUrl in (
      select _endpointUrl from vendors group by _endpointUrl having count(distinct vendor) >= 2
      union
      select _endpointUrl from vendors where vendor is null
    )
;

baltpeter commented 11 months ago

We'll also need to link our legal notice and privacy policy, of course. We can do that using a custom footer template.

For that, we need to run with --template-dir=templates/.

baltpeter commented 11 months ago

I'll also add a datasets table with a foreign key relationship to requests.dataset with more details on the datasets.

CREATE TABLE "datasets" (
    "slug"  TEXT,
    "title" TEXT NOT NULL,
    "description"   TEXT NOT NULL,
    "url"   TEXT NOT NULL,
    "sourceCodeUrl" TEXT,
    PRIMARY KEY("slug")
) WITHOUT ROWID;

Content so far:

INSERT INTO "datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('do-they-track', 'Do they track? Automated analysis of Android apps for privacy violations', 'To get a grasp of how common and extensive data collection in Android apps really is, Malte and Benni developed a series of scripts to download, install and start Android apps in an emulator and to collect and analyse their network traffic.

The apps were run for 60 seconds in an Android 11 emulator without any user input, meaning that no consent was given.', 'https://benjamin-altpeter.de/doc/presentation-android-privacy.pdf', '');
INSERT INTO "datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('informed-consent', 'Informed Consent? A Study of “Consent Dialogs” on Android and iOS', 'Consent dialogs have become ubiquitous with seemingly every website and app pleading users to agree to their personal data being processed and their behaviour being tracked, often with the help of tens or even hundreds of third-party companies. For his master’s thesis, Benni studied consent dialogs on Android and iOS in an automated and dynamic manner, analysing 4,388 popular apps from both platforms. Among other things, he measured the effect of the user’s choice in the consent dialog by comparing the traffic from before any interaction with the traffic after accepting and rejecting the dialog and analysing contacted trackers and transmitted data types.

The apps were initially run for 60 seconds without interaction in an Android 11 emulator and on a physical iPhone 7 running iOS 14.8. In the runs of requests with a `runType` of `no-interaction`, nothing else was done. For the ones with `consent-dialog-accepted`, after 60 seconds, a discovered consent dialog was accepted and then they were left running without interaction for another 60 seconds. The same was done for the ones with `consent-dialog-rejected`, but the consent dialog was rejected here.', 'https://benjamin-altpeter.de/doc/thesis-consent-dialogs.pdf', 'https://github.com/baltpeter/thesis-mobile-consent-dialogs');
INSERT INTO "datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('ios-watching-you', 'iOS watching you: Automated analysis of “zero-touch” privacy violations under iOS', 'As a follow-up project to the “Do they track?”, Benni also looked at the iOS ecosystem. He ended up analysing 1,001 apps from the top charts of the German App Store as of May 2021.

The apps were run for 60 seconds without interaction on a physical iPhone 8 running iOS 14.5.1.', 'https://benjamin-altpeter.de/doc/presentation-ios-privacy.pdf', 'https://github.com/baltpeter/ios-privacy-violations');
INSERT INTO "main"."datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('monkey-july-2023', 'Traffic collection for TrackHAR adapter work (July 2023)', 'For writing new adapters for TrackHAR and properly documenting the old ones, Benni ran a traffic collection on 804 apps on Android and 1062 apps on iOS from the top charts.

The Android apps were run in an Android 11 emulator for 60 seconds, receving random input from `adb monkey`, as such it is possible/likely that consent was given when requested. The iOS apps were run on a physical iPhone X running iOS 15.6.1 for 60 seconds without any interaction.', 'https://github.com/tweaselORG/experiments/issues/1', 'https://github.com/tweaselORG/experiments/tree/main/monkey-july-2023');
INSERT INTO "datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('worrying-confessions', 'Worrying confessions: A look at data safety labels on Android', 'In 2022, the Google Play Store introduced a data safety section in order to give users accessible insights into apps’ data collection practices. To verify the declarations, Benni recorded the network traffic of 500 apps, finding more than one quarter of them transmitting tracking data not declared in their data safety label.

The apps were run for 60 seconds in an Android 11 emulator without any user input.', 'https://www.datarequests.org/blog/android-data-safety-labels-analysis/', 'https://github.com/datenanfragen/android-data-safety-label-analysis');

baltpeter commented 11 months ago

As we had a title column in datasets, Datasette turned requests.dataset into this:

That's a nice feature but annoyingly space-intensive in this case. We can turn that off using a metadata.json (--metadata metadata.json):

{
    "databases": {
        "data": {
            "tables": {
                "datasets": {
                    "label_column": "slug"
                }
            }
        }
    }
}

Now, we still get a link but not the full title in each row:

baltpeter commented 11 months ago

I just realized that the iOS datasets also include requests to Apple endpoints with cookies of our actual Apple IDs… :D

baltpeter commented 11 months ago

A time limit of 1 second for queries stops us from doing interesting things. Let's allow ten seconds instead: --setting sql_time_limit_ms 10000

baltpeter commented 11 months ago

I just realized that the iOS datasets also include requests to Apple endpoints with cookies of our actual Apple IDs… :D

I tried going to https://appleid.apple.com/account/manage and changing my password with the "Sign out of Apple devices and websites associated with your Apple ID." checkbox enabled. And while the iPhones now both do poster me to sign in again and many endpoints do return a 401, laughably I can still do quite a lot with the old authentication tokens, including downloading apps and even buying new (free) ones!

baltpeter commented 11 months ago

Going through the same process again didn't help either.

baltpeter commented 11 months ago

Looks like the only choice we have is to filter out these requests. From looking through the data for a while, I'm pretty confident that using my filtered requests view for that is sufficient.

baltpeter commented 11 months ago

I have written a quick script for creating the DB. That seems better than always updating that comment. :D

baltpeter commented 11 months ago

And everything is working as expected. But queries take ~1s.

EDIT: That's not quite true. For some reason, listing the first page takes almost a second. But the subsequent pages take only ~50ms. Even a query like select * from filtered_requests where host like '%adcolony%'; that returns 1000 results only takes 52ms. So, I don't think that this an issue.

I've found what the problem was here: It was the suggested facets. If I set "suggest_facets": false, requests for the first page also only take ~40ms. And the suggested facets really weren't helpful, anyway (for requests, it only suggested facetting by startTime…).

baltpeter commented 11 months ago

I've started setting up the instance on an Uberspace. Unfortunately, I ran into a few problems:

Probably unsurprisingly, queries take noticeably longer than locally. I had to increase the time limit to 20s for the instance to be useful. Even with 30s, the "requests to endpoints that were only contacted by an app after a consent dialog was accepted" query times out.
Facets also regularly time out.

We cannot use the datasette-sqlite-regex extension (which is very useful). When it is installed, starting Datasette fails with:

sqlite3.OperationalError: /usr/lib64/libc.so.6: version `GLIBC_2.28' not found (required by /home/twsldata/.local/lib/python3.9/site-packages/sqlite_regex/regex0.so)

The GraphQL endpoint is broken. I'm getting 500 errors for it. I haven't looked further into this.

For now, I'll just live with that and remove mentions of the broken features from the homepage. @zner0L can decide what (if anything) we do about this.

baltpeter commented 11 months ago

We cannot use the datasette-sqlite-regex extension (which is very useful).

That is an even bigger problem than I initially thought. It's not just annoying for interactive use but it also means that we can't query for requests that match an adapter (since the endpoint URLs there can be regexes).

baltpeter commented 11 months ago

Two more useful extensions we could install (but which probably won't work on Uberspace, either):

baltpeter commented 10 months ago

There were a few things that bugged me about how our database was rendered by Datasette that I fixed through two custom plugins:

While the regular table pages all have permalinks for each record, that is not the case for custom queries. Now, I absolutely get that implementing this feature generally is next to impossible. So, I cheated instead (https://github.com/tweaselORG/data.tweasel.org/commit/bf199531c16bfa321e4b952787a8e5aef8ea0ed6): We now have a custom link(dataset, id) function that outputs a special token that gets rendered as a link. I've also added that to all the example queries.
By default, Datasette doesn't handle long values (of which we have a ton) gracefully at all, which is why I had initially set truncate_cells_html=100 (https://github.com/tweaselORG/meta/issues/33#issuecomment-1618702644). But that is very annoying when you actually want to work with the data in the truncated fields. So instead, I now implemented custom support for our columns where this matters (https://github.com/tweaselORG/data.tweasel.org/commit/f3971a847a3f6f3a14ea08702374ba32bd8586a8). They are now rendered as a textarea that you can scroll in. For binary content, we fake a download button (we can't use Datasette's internal mechanism yet because for that we'd need access to the request and in render_cell, that is only available starting in Datasette 1, which is currently in alpha) and display a textarea with the base64-encoded value. That is very helpful when you want to copy the data into Cyberchef for example.

While I was added, I also added some custom formatting for headers and footers, which makes them much easier to read in the constraint space. I also added some hacky JS to allow toggling between the raw and formatted view.

The whole implementation depends on hardcoded column names, which certainly isn't ideal but I'm not sure whether we can do better.

baltpeter commented 10 months ago

Another thing I noticed while doing adapter work: It is really annoying not to have an endpoint URL for all requests. Thus, I decided to backfill that for the do-they-track requests, after all: https://github.com/tweaselORG/data.tweasel.org/commit/ff502c5662fccdd7d80e532a51d3ee1b9df60853

baltpeter commented 10 months ago

For now, I'll just live with that and remove mentions of the broken features from the homepage. @zner0L can decide what (if anything) we do about this.

We've decided to move to a Hetzner server.

baltpeter commented 10 months ago

Done. https://data.tweasel.org/ is now running on Hetzner, and all extensions work.

tweaselORG / meta

Create public request database #33