ooni / backend

Everything related to OONI backend infrastructure: ooni/api, ooni/pipeline, ooni/sysadmin, collector, bouncers and test-helpers
BSD 3-Clause "New" or "Revised" License
50 stars 29 forks source link

test-lists: add checks for domain and URL category code inconsistencies #611

Open hellais opened 2 years ago

hellais commented 2 years ago

We need to add the lint-lists script, that is run prior to merging any PR to citizenlab/test-lists, checks to ensure we don't end up with inconsistent category codes in the lists.

The checks we need to implement are the following:

Some research into the prevalence of this problem is documented here: https://gist.github.com/hellais/fab319ae20b0ccca7b548a060ed66e14.

The steps to do this are 2:

  1. Fixup (possibly manually by properly re-categorising them) the URLs affected by the abovementioned issues
  2. Add checks to lint-lists that are run on the CI

This came up as part of: https://github.com/ooni/api/pull/300

FedericoCeratto commented 1 year ago

Related: a basic comparison tool https://jupyter.ooni.org/notebooks/notebooks/2023%20%5Bfederico%5D%20inconsistent%20test-list%20input%20url%20categories.ipynb