Open anjackson opened 3 years ago
The current crawl engine has the block regex's in a file that is read once on startup. But also, regex blocking is quite dangerous, in that it's easy to make a mistake that breaks things or blocks too much. i.e. deployment should be somewhat manual rather than direct from W3ACT. This means it's not clear how best to manage them at present.
One option would be to have the code to generate the list from W3ACT, but have a separate file that gets mapped into the crawler (via ukwa-services
) and update that occasionally. We'd need to update it via an API script too.
Okay, so this is two questions. Managing blocks from W3ACT, and deploying this specific Regex for www.bl.uk. The latter can go in quick. The rest is part of https://trello.com/c/2rtXl07h/29-roll-out-w3act-on-prod-swarm
The archivist role can add the problematic URL to W3ACT already, under a Black List field.
Then, we need to pick up
white_list,black_list
URLs fromtargets.csv
and include them in the crawl feeds. Should be combined with the in-scope andnevercrawl
lists (respectively).After that, we need to check the crawler will pick up changes to the scope and block files, and add a
w3act_export
service to the FC stack that pulls and updates them. This does mean the block list might lag behind the launches a little, so we probably want to update them more often than daily.(clearly, we should consider wildcard/regex support, but that's more difficult to use. Maybe use plain URLs for URL blocks, but allow
#
-delimited lines for RegEx?Hmm.
Also, take ukwa/ukwa-heritrix#85 into account