Open tokee opened 1 year ago
Not there yet, but a lot of work has been done on this.
The class SRequest
(S is for Stream) is a request builder where the caller only has to set the parameters relevant for the current case. There is support for de-duplication (aka grouping where only the first entry in each group is used).
SolrGenericStreaming
, which is used with SRequest
allows for resource expansion. It does not handle image search and it is not clear whether that belongs here.
CSV, JSON and JSON-Lines has been unified and uses SolrGenericStreaming
. WARC export has been upgraded to work with SolrGenericStreaming
. PWID needs refactoring and work on ZIP has not begun. ZIP is heavy to implement so it will probably not be part of this unification.
So pending is PWID and how to tie image search with export.
The current export options are for
245 suggests adding ZIP as an option and #233 calls for ways of restricting the resources to export. One could argue that we should also have a CSV-with-resources.
Instead of just adding to the list of export options, we should split the export logic in 3 parts (which could be shown below each other, in tabs or something third):
New 3 phase system
Initial corpus selection
Whenever the user chooses to perform plain search, image search, grouped search, geo search (with the map function) or something fifth, it should be possible to export "what the user expects" (this is not the same as what the user sees, with WARC-with-resources as an example of that).
Corpus adjustment
Export format
gz
-compression