openrightsgroup / blocked-org-uk

Template front-end code, markup, style-sheets, images and other assets for the Censorship Monitoring Project (blocked.org.uk)
https://www.blocked.org.uk/
GNU General Public License v3.0
13 stars 5 forks source link

Collapse multiple entries for sites which have been entered with capital letters #375

Closed alexhaydock closed 5 years ago

alexhaydock commented 5 years ago

Some sites seem to appear more than once (particularly in lists) with capital letters in them:

https://www.blocked.org.uk/site/http://gay.com https://www.blocked.org.uk/site/http://Gay.com

Judging by the fact that the "Last Checked" date of these two entries available on the frontend is in sync with each other, it seems like the actual backend entry for the site in the database might be the same one, so I'm not sure if these "ghost" sites with odd capitalisation that appear on the frontend are actually affecting our statistics in the backend, but it's worth getting rid of them anyway.

I suppose since we're dealing with domains it makes sense to force everything to be lowercase?

dantheta commented 5 years ago

We'd already deduplicated lots of case-different domains.

Behind the scenes, input URLs do get case-folded. The scheme and hostname parts of the URL are case-insensitive, but the path and query string are not.

I forgot to filter out the old case variants when I was populating the lists.

dantheta commented 5 years ago

I've eliminated a lot of the case-duplicates from the lists. There were 500 or so across all of the lists, so I don't think it will have had a huge statistical impact.

It will take a bit more time to get it completely cleaned up and to make sure that we have test results for all the canonical lower-case URLs.

dantheta commented 5 years ago

That should have sorted it. Looks like we're due for another de-duping session in the main URLs list.

alexhaydock commented 5 years ago

I've eliminated a lot of the case-duplicates from the lists. There were 500 or so across all of the lists, so I don't think it will have had a huge statistical impact.

Thanks Dan. I'll make sure the stats in the report get updated appropriately anyway, just in case.

edjw commented 5 years ago

Can we close this now?

alexhaydock commented 5 years ago

I believe this can be closed, yes.