Get the DOAJ articles into Solr

petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.

The Unlicense

67 stars 17 forks source link

Get the DOAJ articles into Solr #34

Open deadlyvices opened 4 years ago

deadlyvices commented 4 years ago

Download the DOAJ article dump and uncompress it
Do a data-driven indexing into Solr on the Azure box to see how it copes
Investigate the article schema further and write a schema.xml based on it
Re-index with a predefined schema
Document access to the Solr index

deadlyvices commented 4 years ago

The articles are combined into massive JSON files. They will need to be extract from each file. I'll have to think about how to do this, probably using KNIME.

deadlyvices commented 4 years ago

Currently pulling apart the docs using KNIME. However this may be an option: https://lucene.apache.org/solr/guide/8_0/indexing-nested-documents.html

anjackson commented 4 years ago

It’s probably too large to POST to Solr in one big chunk. If you have any problems I should be able to split it with a Python script (and a lotta RAM).

anjackson commented 4 years ago

Ah hang on, it’s broken into batches of 100,000. That might work.

anjackson commented 4 years ago

In case it helps, if you can run jq, you can split the single JSON file into JSONLines format so each line is one element of the original array:

 jq -cn --stream 'fromstream(1|truncate_stream(inputs))' doaj_article_data_2020-04-01/article_batch_1.json > doaj_article_data_2020-04-01/article_batch_1.jsonl

You could then split the jsonl file into smaller chunks, and then use those. I believe Solr supports jsonl format so you should be able to POST them directly into Solr.