Closed netsensei closed 7 years ago
Went with option A:
We create a totally separate, dedicate command (easy to do, but very specific and not flexible)
A new index command was added like this:
dhconveyor index -p pipeline.ini
With pipeline.ini:
[Indexer]
plugin = Solr
[plugin_indexer_Solr]
file_name = '/tmp/bulk.json'
request_handler = 'http://datahub.box:8983/solr/blacklight-core/update/json'
The indexer doesn't create a JSON file because it shouldn't do two things at once. It just pushes an existing flat file to Solr. As such, creating the JSON file can be done with other tools including Datahub::Factory::Transport.
$ catmandu convert OAI --url https://biblio.ugent.be/oai/ --fix xml-to-json.fix to JSON > /tmp/bulk.json
$ jsonlint bulk.json
$ dhconveyor index -p pipeline.ini
Follow up in separate, specific issues.
At this point, the Solr module will push records synchronously to the index. These calls are blocking. Which means each call takes a set amount of time to push, process and return to the factory before the next call is executed.
This approach is usable for small sets of data (ie. 2.000 records) but not for 100.000 records.
Apache Solr provides Import Handlers to import a bulk set of data into the index. Instead of pushing discrete json objects as HTTP messages, a JSON formatted file containing all the records is prepared and pushed to a separate API endpoint. This will trigger a direct, fast import and index process within Solr. 6.000+ records are easily imported in < 2 secs this way.
However. This is a 2 step process:
In our current setup, a pipeline configuration will result in the generation of a JSON file. The architecture doesn't automate the second step. This still needs to be done manually.
So, question is: should / could / how do we integrate this in the factory?
Several options: