Open zner0L opened 1 year ago
I have been thinking about how to document the research into how to decode the tracking requests. We wanted a tool that allows for quick note taking without too may complication. GitHub issues seemed like an option, but we do not want to make people transmit data to GitHub just to look a our documentation.
I look at some option and though that HedgeDoc might be what we are looking for. It enables easy note taking, supports Markdown and is meant for documentation.
I look at some option and though that HedgeDoc might be what we are looking for. It enables easy note taking, supports Markdown and is meant for documentation.
That's a good idea. I've had a quick look and it seems quite well-suited. Two things I'm a bit sad about:
is meant for documentation
Is it?
We've since talked a little more about this. To document what we came up with:
docs.tweasel.org
or a separate subdomain.docs.tweasel.org
, using the same theme as for tracker-wiki. The nav should be something like:
I've started work on the docs site in https://github.com/tweaselORG/docs.tweasel.org/pull/1.
Research documentation regarding trackers can live in a separate section in tweaselORG/tracker-wiki. We do research in an issue and then dump the thread into a page there (with minimal editing afterwards). We haven't decided yet where these issues should be (in TrackHAR or a separate new repo?). I'm leaning towards TrackHAR.
I'm currently working on https://github.com/tweaselORG/TrackHAR/issues/16 still a bit unsure about the workflow, we might need to rethink this. Some concerns I have:
What should the final documentation look like? Given the last point: Do we even really need/want that to be a fully free-form page? I feel like we pretty much always want to have a very rigid and predefined structure (essentially one section per documented property, plus maybe an optional introductory section with general details about the tracker/adapter).
Another benefit of a rigid predefined structure: It relieves you from trying to try and follow "proper" writing practices. And given the nature of the documentation we're creating, it's practically impossible to make that not be very repetitive anyway and just wastes time and effort. We should just embrace the boilerplateyness.
Don't we want to later be able to "paste" our documentation into the complaint instead of only linking to it (I wouldn't be surprised at all if there are DPAs that refuse to click on links)?
And if that's the case, having the reasoning
property be a link to some website would be quite annoying, even if it is our own site. We'd need to write code to always dynamically extract only the relevant section from that page. Yuck.
Since we will definitely also reference third-party documentation (preferredly so, even) , we need to think about how we can archive those links. Link rot is already super common and we're using the documentation "against" the sites' operators. There's nothing stopping them from just changing them at any point.
But archiving isn't exactly easy, either. How do we archive links (only externally, or also our own archive with screenshot/PDF/SinglePage/…?) and how do we ensure that we don't forget to? And of course, the process shouldn't get too much in the way of our actual work, either. If we have to manually save each link with a couple different archiving services, wait for the archiving to finish (can easily take multiple minutes or even longer), and then also paste those archived links somewhere, that will slow us down by a lot.
And we also need to decide which archiving services we want to use. The Wayback Machine is an obvious choice and pretty well trusted, I'd say. But I just wanted to archive https://support.singular.net/hc/en-us/articles/4411780525979-Types-of-Device-IDs and got "The capture failed because Save Page Now does not have access rights for https://support.singular.net/hc/en-us/articles/4411780525979-Types-of-Device-IDs (HTTP status=403)."—the site is blocking archive.org. -.-
I then went with archive.today instead but that service is much less well known/trusted. And I'm not sure whether I would trust them to actually be around and keep all snapshot "indefinitely".
The only other archive I'm aware of is perma.cc, which I would also consider reasonably trusted, especially since they explicitly target legal stuff, but that is quite expensive.
reasoning
? The more I think about it, the more I think we should always link to our documentation which can then reference external documentation if applicable.Maybe a better solution would be to have the documentation in the TrackHAR repo with a structure like this:
tracker-documentation
singular-net
v.md
av.md
_index.md
other-tracker-api-v2
prop1.md
prop2.md
Such that:
singular-net
, or for a specific adapter, e.g. other-tracker-api-2
)._index.md
. Actually, we probably always want that for the frontmatter.reasoning
field, we then reference the Markdown file (e.g. singular-net/v
).tracker-wiki
Also in tracker-wiki
, it's easy for us to produce the correct links for the reasoning, e.g. singular-net/v.md
becomes <base URL>/singular-net#v
.
And in the complaints, we can easily paste the relevant section from the Markdown.
reasoning
actually exists.Another question: If we have a property that has an obvious name or obvious observed values, is it okay to just specify the reasoning accordingly or should we still reference external documentation if it exists?
For example, in https://github.com/tweaselORG/TrackHAR/issues/16 I found https://support.singular.net/hc/en-us/articles/360048588672-Server-to-Server-S2S-API-Endpoint-Reference, which is a super helpful reference of pretty much all properties singular.net uses. So, I went back and updated all properties that I had previously declared as obvious. Now, clearly having explicit documentation from the tracker is better but this also creates a lot more work for us.
And further: If we do decide that we should prefer the official documentation if available, does that mean that we should always try to seek it out? I.e.: If I discover a property in requests and am able to discern what it means based on the name or values, should I still try and find official documentation for (if I haven't otherwise found such documentation yet)? Or is it okay to specify obvious x
and only update that if I do happen to find official documentation later?
We should decide on a guideline on how to handle that (and document that in TrackHAR's contrib section).
And one more: If I do find a list like in the singular.net case, should I add properties to the adapter that are listed there but that we haven't observed in any of our requests (yet)?
I haven't done that for the singular.net adapter. I'm not sure, but it feels kinda wrong to do that. I definitely appreciate that we're now cross-referencing our adapters with official documentation, but to me it feels better to not rely entirely on that and only include properties that we have actually been able to confirm ourselves.
I am having second thoughts about pasting our documentation into the complaints, and always writing our own documentation even if it essentially just references the official documentation:
We will need to translate the complaints. If we want to paste this documentation into them, we will have a ridiculous amount of free-form text to translate.
Having only a few predefined reasonings and a link otherwise is obviously a lot, looot easier.
I will continue doing that for the adapter in https://github.com/tweaselORG/TrackHAR/issues/16, though until we have decision. I can always change it back later.
These are just my raw thoughts, I think we should have a call on that and split your comment up into several smaller issues.
If I discover a property in requests and am able to discern what it means based on the name or values, should I still try and find official documentation for (if I haven't otherwise found such documentation yet)? Or is it okay to specify obvious x and only update that if I do happen to find official documentation later?
I think it is fine to just change it later if we find it. Though, we need to be clear what our threshold for "obvious" is. We have been working with this data for quite long and things might seem obvious to us which aren’t for others/judges.
If I do find a list like in the singular.net case, should I add properties to the adapter that are listed there but that we haven't observed in any of our requests (yet)?
No. We are explicit in that we only adhere to the data we find. There is too much of a risk that the documentation is outdated. Our approach is data-based.
On pasting the documentation into the complaints: I think we could generate a little attachment that is basically a printout of the relevant sections, so the authorities don#t have to click on links. The translation problems would also apply to the online documentation, but it is just far to wide out of scope. They’ll have to deal with English at some point anyway.
I think we should have a call on that
Let's do that, then.
- Can you not just guess at the URL of the page and use that instead?
If we're using a free-form page, we definitely can't. But even with a rigid page structure, I've already had cases where I documented multiple properties in the same section.
- No, I don’t think a free form page is good. In most case we need a rather table like structure. I think what I had in my head was, first, a list of all the possible honey data values used in the analysis run. (We should actually annotate the datasets in data.tweasel.org with these as well.) Then a table were we point out which kind of honey data we found or a link to additional documentation.
I'm not so sure about using a table for two reasons: 1. We will sometimes have rather long descriptions which just look awkward in a table. 2. While you can link to an individual table row, that's not really common and might confuse people more than linking to a heading.
As for the known honey data: Yes, I've also started collecting that while working on the new adapter but that covers only a very small subset of the data the adapters can detect.
- Can we not just write our own archiving script that takes a screenshot or something? This is only for traceability, it does not have to be fancy or pretty or anything. And I think screenshots are pretty much the easiest solution if we want to also include the content of the pages in the complaint as well. This is already very prevalent in legal documents anyway. I really like the way zotero does it in their browser plugin. But we could also just have a CI script or a watchdog that looks for links and takes a screenshot in headless chrome or smth.
Implementing local archiving only might be easy(-ish!) (and we could probably use ArchiveBox for that) but I would definitely much prefer to also have a public archive. For one, we are hosting public documentation of the adapters—we don't want the reasoning links in there to just break. Additionally, German Urheberrecht is very restrictive. We wouldn't even be able to provide our screenshots to our users to include in their complaints.
- Yeah, I actually think this is more sensible tbh. Often we will need to provide context anyway.
Well, as I said in a later comment, I'm not so sure about that anymore. Actually doing that significantly the time I spent (by a factor >2). And in the majority of cases, the text I wrote provides no additional information whatsoever (e.g. https://github.com/tweaselORG/TrackHAR/issues/16#issuecomment-1682523347, https://github.com/tweaselORG/TrackHAR/issues/16#issuecomment-1682553354), I just had a template that I replaced various values in. There were some cases where I did add additional helpful context (e.g. https://github.com/tweaselORG/TrackHAR/issues/16#issuecomment-1682535237, https://github.com/tweaselORG/TrackHAR/issues/16#issuecomment-1682543705), but definitely not always.
On pasting the documentation into the complaints: I think we could generate a little attachment that is basically a printout of the relevant sections, so the authorities don#t have to click on links. The translation problems would also apply to the online documentation, but it is just far to wide out of scope. They’ll have to deal with English at some point anyway.
From what we've heard from talking to others, I'd be cautious about including English text in the generated complaints. Linking to external sources (which our site would be for our users) feels different, the complainant doesn't control those.
We decided:
We will document honey data we think is useful to understand the research as JSON in the datasets
table of data.tweasel.org and create a nice human-readable view using a plugin.
I feel like should start documenting as early as possible, so I would like to find a way in which we can document what we are doing already.