netarchivesuite / solrwayback

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
Apache License 2.0
97 stars 20 forks source link

Rethink export #246

Open tokee opened 1 year ago

tokee commented 1 year ago

The current export options are for

245 suggests adding ZIP as an option and #233 calls for ways of restricting the resources to export. One could argue that we should also have a CSV-with-resources.

Instead of just adding to the list of export options, we should split the export logic in 3 parts (which could be shown below each other, in tabs or something third):

New 3 phase system

Initial corpus selection

Whenever the user chooses to perform plain search, image search, grouped search, geo search (with the map function) or something fifth, it should be possible to export "what the user expects" (this is not the same as what the user sees, with WARC-with-resources as an example of that).

Corpus adjustment

Export format

tokee commented 1 year ago

Not there yet, but a lot of work has been done on this.

Initial corpus selection

The class SRequest (S is for Stream) is a request builder where the caller only has to set the parameters relevant for the current case. There is support for de-duplication (aka grouping where only the first entry in each group is used).

Corpus adjustment

SolrGenericStreaming, which is used with SRequest allows for resource expansion. It does not handle image search and it is not clear whether that belongs here.

Corpus adjustment

CSV, JSON and JSON-Lines has been unified and uses SolrGenericStreaming. WARC export has been upgraded to work with SolrGenericStreaming. PWID needs refactoring and work on ZIP has not begun. ZIP is heavy to implement so it will probably not be part of this unification.

So pending is PWID and how to tie image search with export.