ooni / ooni.org

The ooni.org homepage and all cross organisational issues
https://ooni.org
Other
76 stars 63 forks source link

[Feature] Resolve daily Tranco list on the probes #1604

Open jdejaegh opened 2 months ago

jdejaegh commented 2 months ago

Context

The Tranco list (https://tranco-list.eu) is a “A Research-Oriented Top Sites Ranking Hardened Against Manipulation”. It is often used by researchers to assess what were the most popular domains for a given time. The list is generated daily since 2019.

When doing historical network simulations (e.g., in the case of research around Tor) it is helpful to know what were the popular destinations at the simulated time. However, those domains need to be resolved to IP addresses to be meaningful in the simulation. DNS queries can yield different results depending on the geographic location of the client. DNS records can also be updated over time.

For those reasons, I feel there is a venue for a “DNS resolved Tranco list” dataset and OONI might be the right place to build and host this data. The goal of the dataset is to have a history of how the DNS queries for popular domains were resolved over time and across countries.

Feature proposal

Create a new nettest to resolve the top N domains of today's Tranco list. For each domain, the probe would ideally contribute the following data back to OONI:

The value of N needs to be set to a reasonable number: the Tranco list can be downloaded for the top 1M or the full list (today's list contains ~4.5M domains). In my opinion, starting with a smaller value for N (like 1K or 2K) may already provide interesting data.

As the list is designed to be hardened against manipulation, one might expect the same set of domains to appear in the top of the list for multiple days in a row.

Alternatives

While writing the proposal, I also considered the Ripe Atlas (https://atlas.ripe.net) to run these measurements. However, their credit model seems restrictive: to resolve the top 1K (A and AAAA), once a day, on 50 different probes I would need to run 47 probes 24/7 to earn enough credits.

Contributing

If this proposal is accepted by OONI, I may take part in the development of the feature and submit PRs.

hellais commented 1 month ago

Thanks for writing up this proposal. This does sound like an interesting idea to explore.

I will leave here a few thoughts on things we should consider while designing an experiment to do this.

The biggest concern for sure is user safety. The main things we should assess what impact they may have are:

  1. Can the DNS answer for something inside of the tranco list return an address which we are not able to scrub using our public IP scrubbing functionality?
  2. Are there addresses inside of these lists that would raise suspicions if users were to resolve them? What about the fact that a user is resolving many domains in a short period of time?

One way in which we could address this is through either making this measurement only run on certain classes of probes (for example only the ones running on servers or iThena) or provide an additional opt-in mechanism to turn on these tests.

Another approach would be to see if it's feasible to do some level of sanitisation of the top N tranco domains and exclude things that are known to be bad, though this is likely to be something quite challenging.

In any case the risk to end users is probably the first and most important thing that needs to be assessed.

Regarding the implementation, I think we would rather have this be implemented as a general purpose DNS resolution test for which the provisioning of inputs is handled entirely on the backend-side. This is similar to how we run our web_connectivity test where we have the ability to return different sets of addresses depending on the probe that's requesting it or even disable running the test altogether. For example I think it would be possible to do what you would like to do using the dnscheck experiment. Do you care to use the system resolver configured on the probe or is it OK to make use of a hardcoded resolver address like 8.8.8.8?

Another aspect to consider is what the potential impact of running such a test at scale on all our probes. Do you have a sense for how many domains you would like to have tested and with what frequency?

For example our backend already has support for performing url list prioritization in order to maximize coverage (i.e. only run inputs that haven't been run in a certain day in a particular network) and spread the load across multiple probes.

The reason to ask this, is that we need to consider both how much additional network load a potential test like this would add onto our probes (both for measurement execution, but also for upload to our collector) and what load this would add to our backend infrastructure that would need process this additional measurement data.

jdejaegh commented 3 weeks ago

Thanks for the detailed reply.

  1. Can the DNS answer for something inside of the tranco list return an address which we are not able to scrub using our public IP scrubbing functionality?

Could you describe or provide a link describing what you mean with “our public IP scrubbing functionality”?

  1. Are there addresses inside of these lists that would raise suspicions if users were to resolve them?

Not sure what a suspicious domain would be, and it might change depending on the region. From a quick inspection of today's list, multiple domains could be concerning: e.g., prohibited social media websites, newspapers, adult content websites.

What about the fact that a user is resolving many domains in a short period of time?

Do you have a sense for how many domains you would like to have tested and with what frequency?

Resolving the top 1K (or up to 10K) might be interesting. One resolution per domain, per day, per probe (or per AS) would be an upper bound on the frequency. The Tranco list is updated daily, so a more frequent check does not bring more value.

As you suggest provisioning domains to resolve from the backend, we could implement a smarter selection of domains to resolve among the top 1K, similar to the URL list prioritization you describe. For example, only resolve domains that we did not yet resolve in the last 72 hours for in the AS the probe is located.

One way in which we could address this is through either making this measurement only run on certain classes of probes (for example only the ones running on servers or iThena)

Starting with only servers and iThena probes would already help a lot compared to the current situation where we do not have the data at all. Providing an opt-in test to the user operated probes is an interesting next step.

Moreover, running the tests on servers and on iThena probes is less likely to put at risk users. It is also less concerning to resolve hundreds of domains in such setting.

For example I think it would be possible to do what you would like to do using the dnscheck experiment.

From the description of the experiment, the output contains the data we are interested in: network location of the probe (AS, IP, CC, time) and the result of the DNS resolution for the domain.

Do you care to use the system resolver configured on the probe or is it OK to make use of a hardcoded resolver address like 8.8.8.8?

Using the resolver configured on the probe is more interesting, however using a hard-coded public resolver may be sufficient if using the probe's resolver put the user at risk somehow.

I'd like to emphasize that the data collected with this proposal could be of broader use and not only help in our case. Future Internet measurement studies could benefit from such data when dealing with historical domain names. A non-exhaustive list may be: censorship research papers, Shadow Tor simulations, Website Fingerprinting papers, traffic confirmation papers. It is unclear how the current bias resulting from a local resolution of the Tranco list affects all these fields, and answering the question once OONI deploys the proposal could be valuable.