mediacloud / web-search

Code that drives the public web-based tools for the Media Cloud Online News Archive and Directory.
https://search.mediacloud.org
Apache License 2.0
9 stars 15 forks source link

URL Search string fixes in the directory #823

Open pgulley opened 3 days ago

pgulley commented 3 days ago

@philbudne Made a summary histogram of the different formats we see present in the url_search_string field in the directory:

1562705 rows where url_search_string is NULL 18 rows where url_search_string is empty string 7 rows where url_search_string starts with "http" 62 rows where url_search_string starts with "" 212 rows where url_search string doesn't start with http or

We should decide on a standard format we want those to appear in, (probably: scheme/do.ma.in[/path] with a wildcard in some non-zero position of path), document it somewhere, and enforce that standard across the directory. This will involve some additional validation in web-search to enforce going forward, and a sweep across the ~300 entries to try and bring them up-to-date. Thinking now that this is a good 'ticketing' test case.

pgulley commented 1 day ago

Current consideration is that we want that scheme to be set as do.ma.in[/path] and omit the scheme in the database, instead preferring the scheme to be interpolated in web-search (per #822)