We want to start automatically indexing FC WARCs for full-text search.
See e.g. website/scripts/run-solr-indexer.sh for the basic operation.
The Solr indexer uses a SurtPrefixSet for the Open Access list, so that is expected to be SURTs. This should be provided by the OA Surts file generated by w3act_export.
The Solr indexer uses a StaticMapExclusionFilterFactory for exclusions, like Open Wayback, so this can be a mixture of URLs and SURTs. The PyWB block files are manually-managed files from the internal GitLab repo.
Some of this will be done in ukwa-manage rather than here, but we'll need an Airflow runner.
Create a Solr indexer that:
[x] Runs like the CDX Indexer, tracking progress in TrackDB.
We want to start automatically indexing FC WARCs for full-text search.
website/scripts/run-solr-indexer.sh
for the basic operation.w3act_export
.Some of this will be done in
ukwa-manage
rather than here, but we'll need an Airflow runner.Create a Solr indexer that:
commit
at the end?