thedatahub / Datahub-Factory

Datahub::Factory - Transport metadata between Collection Management Systems and the Datahub
Other
2 stars 4 forks source link

Bulk import into Solr #42

Closed netsensei closed 7 years ago

netsensei commented 7 years ago

At this point, the Solr module will push records synchronously to the index. These calls are blocking. Which means each call takes a set amount of time to push, process and return to the factory before the next call is executed.

This approach is usable for small sets of data (ie. 2.000 records) but not for 100.000 records.

Apache Solr provides Import Handlers to import a bulk set of data into the index. Instead of pushing discrete json objects as HTTP messages, a JSON formatted file containing all the records is prepared and pushed to a separate API endpoint. This will trigger a direct, fast import and index process within Solr. 6.000+ records are easily imported in < 2 secs this way.

However. This is a 2 step process:

  1. Generate a JSON file from a data source using a fix. The schema adheres to the Solr Schema.
  2. Push the JSON file to the Solr index via the import handler.

In our current setup, a pipeline configuration will result in the generation of a JSON file. The architecture doesn't automate the second step. This still needs to be done manually.

So, question is: should / could / how do we integrate this in the factory?

Several options:

netsensei commented 7 years ago

Went with option A:

We create a totally separate, dedicate command (easy to do, but very specific and not flexible)

A new index command was added like this:

dhconveyor index -p pipeline.ini

With pipeline.ini:

[Indexer]
plugin = Solr

[plugin_indexer_Solr]
file_name = '/tmp/bulk.json'
request_handler = 'http://datahub.box:8983/solr/blacklight-core/update/json'

The indexer doesn't create a JSON file because it shouldn't do two things at once. It just pushes an existing flat file to Solr. As such, creating the JSON file can be done with other tools including Datahub::Factory::Transport.

$ catmandu convert OAI --url https://biblio.ugent.be/oai/ --fix xml-to-json.fix to JSON > /tmp/bulk.json
$ jsonlint bulk.json
$ dhconveyor index -p pipeline.ini
netsensei commented 7 years ago

Follow up in separate, specific issues.