ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
5 stars 5 forks source link

Ensure block list gets updated from W3ACT to the FC #36

Open anjackson opened 3 years ago

anjackson commented 3 years ago

The archivist role can add the problematic URL to W3ACT already, under a Black List field.

Then, we need to pick up white_list,black_list URLs from targets.csv and include them in the crawl feeds. Should be combined with the in-scope and nevercrawl lists (respectively).

After that, we need to check the crawler will pick up changes to the scope and block files, and add a w3act_export service to the FC stack that pulls and updates them. This does mean the block list might lag behind the launches a little, so we probably want to update them more often than daily.

(clearly, we should consider wildcard/regex support, but that's more difficult to use. Maybe use plain URLs for URL blocks, but allow #-delimited lines for RegEx?

https://www.bl.uk/?mobile=on
#twitter\.com/.*?lang=#

Hmm.

Also, take ukwa/ukwa-heritrix#85 into account

anjackson commented 3 years ago

The current crawl engine has the block regex's in a file that is read once on startup. But also, regex blocking is quite dangerous, in that it's easy to make a mistake that breaks things or blocks too much. i.e. deployment should be somewhat manual rather than direct from W3ACT. This means it's not clear how best to manage them at present.

One option would be to have the code to generate the list from W3ACT, but have a separate file that gets mapped into the crawler (via ukwa-services) and update that occasionally. We'd need to update it via an API script too.

anjackson commented 3 years ago

Okay, so this is two questions. Managing blocks from W3ACT, and deploying this specific Regex for www.bl.uk. The latter can go in quick. The rest is part of https://trello.com/c/2rtXl07h/29-roll-out-w3act-on-prod-swarm