Bulk import into Solr - Githubissues

netsensei commented 7 years ago

At this point, the Solr module will push records synchronously to the index. These calls are blocking. Which means each call takes a set amount of time to push, process and return to the factory before the next call is executed.

This approach is usable for small sets of data (ie. 2.000 records) but not for 100.000 records.

Apache Solr provides Import Handlers to import a bulk set of data into the index. Instead of pushing discrete json objects as HTTP messages, a JSON formatted file containing all the records is prepared and pushed to a separate API endpoint. This will trigger a direct, fast import and index process within Solr. 6.000+ records are easily imported in < 2 secs this way.

However. This is a 2 step process:

Generate a JSON file from a data source using a fix. The schema adheres to the Solr Schema.
Push the JSON file to the Solr index via the import handler.

In our current setup, a pipeline configuration will result in the generation of a JSON file. The architecture doesn't automate the second step. This still needs to be done manually.

So, question is: should / could / how do we integrate this in the factory?

Several options:

We create a totally separate, dedicate command (easy to do, but very specific and not flexible)
We add support for pre / post processors in the transport command which act as handles throughout the pipeline via event listeners.
We create a custom exporter in the Arthub module which does the outputting to JSON and the push to Solr in one move.

netsensei commented 7 years ago

Went with option A:

We create a totally separate, dedicate command (easy to do, but very specific and not flexible)

A new index command was added like this:

dhconveyor index -p pipeline.ini

With pipeline.ini:

[Indexer]
plugin = Solr

[plugin_indexer_Solr]
file_name = '/tmp/bulk.json'
request_handler = 'http://datahub.box:8983/solr/blacklight-core/update/json'

The indexer doesn't create a JSON file because it shouldn't do two things at once. It just pushes an existing flat file to Solr. As such, creating the JSON file can be done with other tools including Datahub::Factory::Transport.

$ catmandu convert OAI --url https://biblio.ugent.be/oai/ --fix xml-to-json.fix to JSON > /tmp/bulk.json
$ jsonlint bulk.json
$ dhconveyor index -p pipeline.ini

netsensei commented 7 years ago

Follow up in separate, specific issues.

thedatahub / Datahub-Factory

Bulk import into Solr #42