ooni / ooni.org

The ooni.org homepage and all cross organisational issues
https://ooni.org
Other
75 stars 62 forks source link

Update test lists with URLs for specific pages #1242

Open sloncocs opened 2 years ago

sloncocs commented 2 years ago

There are multiple country-specific test lists with numerous URLs using HTTPS protocol but leading to specific pages (approximately hundreds across all test lists). As separate subpages aren't being blocked if the website is based on HTTPS, there is a need for extensive clean-up for such lists.

Questions to discuss:

Examples of the lists: https://github.com/citizenlab/test-lists/blob/master/lists/ph.csv https://github.com/citizenlab/test-lists/blob/master/lists/ga.csv https://github.com/citizenlab/test-lists/blob/master/lists/ml.csv

bassosimone commented 2 years ago

As separate subpages aren't being blocked if the website is based on HTTPS, there is a need for extensive clean-up for such lists. [...]

The emphasized part of the statement above is correct. Yet, I think it's important to reflect on the fact that the test lists grew to include HTTPS URLs added for intents that go beyond checking whether the URL domain is blocked.

Some pages were added as http and were upgraded to https without replacing the path with /. Some other pages were added as https but the intent was to check whether the whole service was blocked. These first two cases map nicely to the possibly-automatic refactoring discussed above (/$path = > /).

However, other pages were added with a large-enough resource as a path to detect whether a specific website was heavily throttled based on the time to download the resource. Additionally, some other pages were added with a /robots.txt URL path specifically to avoid hitting onto real resources and thus making the measurement clearly a measurement for testing purposes as opposed to fetching a potentially controversial resource such as the homepage.

Can it be an automated process?

Yes, absolutely! However, the test lists do not tell us which was the intent behind adding URLs. Therefore just blindly upgrading all the HTTPS URLs to have / as their path will break the original intent for some of the URLs.

I think an algorithm to upgrade could roughly look like:

  1. do not touch any URL that fetches /robots.txt

  2. do not touch any URL that fetches a resource from a CDN

  3. if the URL is fetching a resource and the resource is technical (e.g., JavaScript or CSS), we should keep it, because it might be better than fetching from the homepage

  4. if the resource we're fetching is comparable in size to the homepage, we can switch to the homepage

(after we https://github.com/ooni/ooni.org/issues/1226 where needed)

I will also comment on the original issue, but I think we should not actually do this for now. (See https://github.com/ooni/ooni.org/issues/1226#issuecomment-1249204025.)

Should we clean all test lists or only those where there are more than 100 (?) websites to test to avoid the situation when there are too few websites to test in the list?

If we know that the semantics is just checking whether a URL's domain is blocked, I'd merge them.

hellais commented 11 months ago

Yes this is a good idea, which I agree we should at some point do. It's however not high priority imho, so I am bumping it down to low (note: high priority is only for stuff which MUST happen very soon or else. The fact that it hasn't been updated in the last 9 months makes me believe it's probably fine to be low)