Decide on how we want to document things

zner0L commented 1 year ago

I feel like should start documenting as early as possible, so I would like to find a way in which we can document what we are doing already.

zner0L commented 1 year ago

I have been thinking about how to document the research into how to decode the tracking requests. We wanted a tool that allows for quick note taking without too may complication. GitHub issues seemed like an option, but we do not want to make people transmit data to GitHub just to look a our documentation.

I look at some option and though that HedgeDoc might be what we are looking for. It enables easy note taking, supports Markdown and is meant for documentation.

baltpeter commented 1 year ago

I look at some option and though that HedgeDoc might be what we are looking for. It enables easy note taking, supports Markdown and is meant for documentation.

That's a good idea. I've had a quick look and it seems quite well-suited. Two things I'm a bit sad about:

I'd like an automatic list of all documents/page like in a wiki. I guess we can create something like that manually but that's a bit tedious.
There once was a PDF export but it was removed.

is meant for documentation

Is it?

baltpeter commented 1 year ago

We've since talked a little more about this. To document what we came up with:

Research documentation regarding trackers can live in a separate section in tweaselORG/tracker-wiki. We do research in an issue and then dump the thread into a page there (with minimal editing afterwards). We haven't decided yet where these issues should be (in TrackHAR or a separate new repo?). I'm leaning towards TrackHAR.
For API reference docs, we use Typedoc proper using the setup I described in https://github.com/tweaselORG/meta/issues/13. Not sure yet whether we want this on docs.tweasel.org or a separate subdomain.
For all other documentation, we create a separate Hugo site docs.tweasel.org, using the same theme as for tracker-wiki. The nav should be something like:
- General introduction to the project and components
- Usage
  - Setup instructions (essentially the "Host dependencies" and "Device preparation" sections from the current appstraction README + maybe how to install the components). We've now reached a point where these instructions are essentially identical regardless of which module you want to use, so we can luckily moves this out of the READMEs and just link here.
  - CLI tutorial (essentially the DPA handout I wrote minus setup)
  - Common problems (Missing libcrypt.so.1, installation takes a long time, Node version too old, missing udev rules, cert unpinning fails …)
  - …?
- Background (not sure about the section title yet)
  - Overview of how our tools work and what dependencies we use
  - Alternatives (how to manually reproduce what our tools do with third-party tools)
- API reference
  - [links to the API docs for each module]
- [link to tracker-wiki]

baltpeter commented 1 year ago

I've started work on the docs site in https://github.com/tweaselORG/docs.tweasel.org/pull/1.

baltpeter commented 10 months ago

Research documentation regarding trackers can live in a separate section in tweaselORG/tracker-wiki. We do research in an issue and then dump the thread into a page there (with minimal editing afterwards). We haven't decided yet where these issues should be (in TrackHAR or a separate new repo?). I'm leaning towards TrackHAR.

I'm currently working on https://github.com/tweaselORG/TrackHAR/issues/16 still a bit unsure about the workflow, we might need to rethink this. Some concerns I have:

When working on the adapter, I need to provide the reasoning for each data path immediately. But the "exported" page is only created at the end. For now, I'm working around that by linking to the issue comment and will then later have to go through all those and replace them with the final links. That is quite unergonomic.
What do we even want to document, how in-depth do we want to go? When I visited singular.net, I noticed that they very plainly admit to the purposes of their platform and the data they're collecting/what they're doing with that, and I had the urge to also include that in the documentation. But I'm not sure whether that's actually a good idea. On the one hand, it would be good to establish that this is actually a tracker and what the data is being used for. But on the other hand, that also really feels like scope creep (that's essentially journalism) and like more than we can handle. And in the end, we are always saying that we're only looking at the transmitted data and not at what happens with it afterwards. And legally speaking, the transmitted data should "speak for itself", the authorities don't really have to care about whether the data is being used for purposes we consider "gross".
What should the final documentation look like? Given the last point: Do we even really need/want that to be a fully free-form page? I feel like we pretty much always want to have a very rigid and predefined structure (essentially one section per documented property, plus maybe an optional introductory section with general details about the tracker/adapter).

Another benefit of a rigid predefined structure: It relieves you from trying to try and follow "proper" writing practices. And given the nature of the documentation we're creating, it's practically impossible to make that not be very repetitive anyway and just wastes time and effort. We should just embrace the boilerplateyness.
Don't we want to later be able to "paste" our documentation into the complaint instead of only linking to it (I wouldn't be surprised at all if there are DPAs that refuse to click on links)?

And if that's the case, having the reasoning property be a link to some website would be quite annoying, even if it is our own site. We'd need to write code to always dynamically extract only the relevant section from that page. Yuck.
Since we will definitely also reference third-party documentation (preferredly so, even) , we need to think about how we can archive those links. Link rot is already super common and we're using the documentation "against" the sites' operators. There's nothing stopping them from just changing them at any point.

But archiving isn't exactly easy, either. How do we archive links (only externally, or also our own archive with screenshot/PDF/SinglePage/…?) and how do we ensure that we don't forget to? And of course, the process shouldn't get too much in the way of our actual work, either. If we have to manually save each link with a couple different archiving services, wait for the archiving to finish (can easily take multiple minutes or even longer), and then also paste those archived links somewhere, that will slow us down by a lot.

And we also need to decide which archiving services we want to use. The Wayback Machine is an obvious choice and pretty well trusted, I'd say. But I just wanted to archive https://support.singular.net/hc/en-us/articles/4411780525979-Types-of-Device-IDs and got "The capture failed because Save Page Now does not have access rights for https://support.singular.net/hc/en-us/articles/4411780525979-Types-of-Device-IDs (HTTP status=403)."—the site is blocking archive.org. -.-
I then went with archive.today instead but that service is much less well known/trusted. And I'm not sure whether I would trust them to actually be around and keep all snapshot "indefinitely".
The only other archive I'm aware of is perma.cc, which I would also consider reasonably trusted, especially since they explicitly target legal stuff, but that is quite expensive.
We also can't rely on archiving services to actually successfully archive pages. I regularly encounter sites that break on the WM and when I tried to archive the singular.net page through perma.cc, it only archived a Cloudflare challenge page.
ArchiveBox sounds like a good solution to many of these (archiving) problems but I'm not so sure. For one, it can only (externally) archive to archive.org (not really their fault, archive.today prompts with Recaptcha every time shrug). Also, I have tested it multiple times over the years and it always felt very unreliable, archive requests tend to stay "pending…" forever and need to be manually restarted. Though, in my most recent test, it seemed to me like the CLI was a lot more reliable than the web app, so maybe it's not too bad. But even then, I noticed a huge problem with consent dialogs obscuring the content, making the snapshot useless. There has been a pinned issue for that for ages with little to no actual development happening so far.
And if we do archive links, I guess we'll want to always include both the original and archived link.
Do we ever actually want to only provide a third-party link as reasoning? The more I think about it, the more I think we should always link to our documentation which can then reference external documentation if applicable.
And I guess our documentation should always quote from the linked documentation or even include screenshots of relevant sections. That should be covered by the Zitatrecht and brings a number of advantages (it's an additional way to archive the content, DPAs don't have to click on links in that case, we can make it more clear which part of a—potentially long—documentation site we're actually referring to.
Despite all this, I would really still prefer to have the tracker research documentation in a git repo.

baltpeter commented 10 months ago

Maybe a better solution would be to have the documentation in the TrackHAR repo with a structure like this:

tracker-documentation
- singular-net
  - v.md
  - av.md
  - _index.md
- other-tracker-api-v2
  - prop1.md
  - prop2.md

Such that:

Every page of the final rendered documentation gets its own folder (can either be for a tracker as a whole, e.g. singular-net, or for a specific adapter, e.g. other-tracker-api-2).
Inside that folder, each property we document gets its own file. If we want an introductory section, that gets an _index.md. Actually, we probably always want that for the frontmatter.
In the reasoning field, we then reference the Markdown file (e.g. singular-net/v).
We can write a build script that glues the sections for each page together in tracker-wiki
Also in tracker-wiki, it's easy for us to produce the correct links for the reasoning, e.g. singular-net/v.md becomes <base URL>/singular-net#v.

And in the complaints, we can easily paste the relevant section from the Markdown.
We could write another script that deals with the archiving and turns links into some custom shortcode-like thing that includes both the original and all archived links (which enables the script to tell which links have already been archived).
We should be able to use something like https://frontmatter.codes/ to make editing the Markdown files (and especially inserting screenshots) less of a pain.
We'd also need a precommit+CI check that ensure that all documents referenced in a reasoning actually exists.

baltpeter commented 10 months ago

Another question: If we have a property that has an obvious name or obvious observed values, is it okay to just specify the reasoning accordingly or should we still reference external documentation if it exists?

For example, in https://github.com/tweaselORG/TrackHAR/issues/16 I found https://support.singular.net/hc/en-us/articles/360048588672-Server-to-Server-S2S-API-Endpoint-Reference, which is a super helpful reference of pretty much all properties singular.net uses. So, I went back and updated all properties that I had previously declared as obvious. Now, clearly having explicit documentation from the tracker is better but this also creates a lot more work for us.

And further: If we do decide that we should prefer the official documentation if available, does that mean that we should always try to seek it out? I.e.: If I discover a property in requests and am able to discern what it means based on the name or values, should I still try and find official documentation for (if I haven't otherwise found such documentation yet)? Or is it okay to specify obvious x and only update that if I do happen to find official documentation later?

We should decide on a guideline on how to handle that (and document that in TrackHAR's contrib section).

baltpeter commented 10 months ago

And one more: If I do find a list like in the singular.net case, should I add properties to the adapter that are listed there but that we haven't observed in any of our requests (yet)?

I haven't done that for the singular.net adapter. I'm not sure, but it feels kinda wrong to do that. I definitely appreciate that we're now cross-referencing our adapters with official documentation, but to me it feels better to not rely entirely on that and only include properties that we have actually been able to confirm ourselves.

baltpeter commented 10 months ago

I am having second thoughts about pasting our documentation into the complaints, and always writing our own documentation even if it essentially just references the official documentation:

We will need to translate the complaints. If we want to paste this documentation into them, we will have a ridiculous amount of free-form text to translate.

Having only a few predefined reasonings and a link otherwise is obviously a lot, looot easier.
Producing our own documentation for each property that is already documented is a lot more work than I expected. And it's very boring and repetitive work that feels entirely pointless as a bonus.

I will continue doing that for the adapter in https://github.com/tweaselORG/TrackHAR/issues/16, though until we have decision. I can always change it back later.

zner0L commented 10 months ago

These are just my raw thoughts, I think we should have a call on that and split your comment up into several smaller issues.

Can you not just guess at the URL of the page and use that instead?
I agree that it is out of scope. I think we should hint at additional documentation on the page, though.
No, I don’t think a free form page is good. In most case we need a rather table like structure. I think what I had in my head was, first, a list of all the possible honey data values used in the analysis run. (We should actually annotate the datasets in data.tweasel.org with these as well.) Then a table were we point out which kind of honey data we found or a link to additional documentation.
Can we not just write our own archiving script that takes a screenshot or something? This is only for traceability, it does not have to be fancy or pretty or anything. And I think screenshots are pretty much the easiest solution if we want to also include the content of the pages in the complaint as well. This is already very prevalent in legal documents anyway. I really like the way zotero does it in their browser plugin. But we could also just have a CI script or a watchdog that looks for links and takes a screenshot in headless chrome or smth.
see 5
see 5
I agree!
Yeah, I actually think this is more sensible tbh. Often we will need to provide context anyway.

If I discover a property in requests and am able to discern what it means based on the name or values, should I still try and find official documentation for (if I haven't otherwise found such documentation yet)? Or is it okay to specify obvious x and only update that if I do happen to find official documentation later?

I think it is fine to just change it later if we find it. Though, we need to be clear what our threshold for "obvious" is. We have been working with this data for quite long and things might seem obvious to us which aren’t for others/judges.

If I do find a list like in the singular.net case, should I add properties to the adapter that are listed there but that we haven't observed in any of our requests (yet)?

No. We are explicit in that we only adhere to the data we find. There is too much of a risk that the documentation is outdated. Our approach is data-based.

On pasting the documentation into the complaints: I think we could generate a little attachment that is basically a printout of the relevant sections, so the authorities don#t have to click on links. The translation problems would also apply to the online documentation, but it is just far to wide out of scope. They’ll have to deal with English at some point anyway.

baltpeter commented 10 months ago

I think we should have a call on that

Let's do that, then.

Can you not just guess at the URL of the page and use that instead?

If we're using a free-form page, we definitely can't. But even with a rigid page structure, I've already had cases where I documented multiple properties in the same section.

No, I don’t think a free form page is good. In most case we need a rather table like structure. I think what I had in my head was, first, a list of all the possible honey data values used in the analysis run. (We should actually annotate the datasets in data.tweasel.org with these as well.) Then a table were we point out which kind of honey data we found or a link to additional documentation.

I'm not so sure about using a table for two reasons: 1. We will sometimes have rather long descriptions which just look awkward in a table. 2. While you can link to an individual table row, that's not really common and might confuse people more than linking to a heading.

As for the known honey data: Yes, I've also started collecting that while working on the new adapter but that covers only a very small subset of the data the adapters can detect.

Can we not just write our own archiving script that takes a screenshot or something? This is only for traceability, it does not have to be fancy or pretty or anything. And I think screenshots are pretty much the easiest solution if we want to also include the content of the pages in the complaint as well. This is already very prevalent in legal documents anyway. I really like the way zotero does it in their browser plugin. But we could also just have a CI script or a watchdog that looks for links and takes a screenshot in headless chrome or smth.

Implementing local archiving only might be easy(-ish!) (and we could probably use ArchiveBox for that) but I would definitely much prefer to also have a public archive. For one, we are hosting public documentation of the adapters—we don't want the reasoning links in there to just break. Additionally, German Urheberrecht is very restrictive. We wouldn't even be able to provide our screenshots to our users to include in their complaints.

Yeah, I actually think this is more sensible tbh. Often we will need to provide context anyway.

Well, as I said in a later comment, I'm not so sure about that anymore. Actually doing that significantly the time I spent (by a factor >2). And in the majority of cases, the text I wrote provides no additional information whatsoever (e.g. https://github.com/tweaselORG/TrackHAR/issues/16#issuecomment-1682523347, https://github.com/tweaselORG/TrackHAR/issues/16#issuecomment-1682553354), I just had a template that I replaced various values in. There were some cases where I did add additional helpful context (e.g. https://github.com/tweaselORG/TrackHAR/issues/16#issuecomment-1682535237, https://github.com/tweaselORG/TrackHAR/issues/16#issuecomment-1682543705), but definitely not always.

On pasting the documentation into the complaints: I think we could generate a little attachment that is basically a printout of the relevant sections, so the authorities don#t have to click on links. The translation problems would also apply to the online documentation, but it is just far to wide out of scope. They’ll have to deal with English at some point anyway.

From what we've heard from talking to others, I'd be cautious about including English text in the generated complaints. Linking to external sources (which our site would be for our users) feels different, the complainant doesn't control those.

zner0L commented 10 months ago

We decided:

For archiving, we will use ArchiveBox to have our own local archive and to create public links to the Web Archive at archive.org and archive.today. If other people contribute, we manually archive their links.
We will not copy parts of the documentation into the requests, as we can not translate them easily.
The data structure for modular research pages @baltpeter suggested will be used to document adapters.
External links and referrering to property names as obvious is ok. If we find documentation on these properties later, we add the reference.

zner0L commented 10 months ago

We will document honey data we think is useful to understand the research as JSON in the datasets table of data.tweasel.org and create a nice human-readable view using a plugin.

tweaselORG / meta

Decide on how we want to document things #3