Open deadlyvices opened 4 years ago
The articles are combined into massive JSON files. They will need to be extract from each file. I'll have to think about how to do this, probably using KNIME.
Currently pulling apart the docs using KNIME. However this may be an option: https://lucene.apache.org/solr/guide/8_0/indexing-nested-documents.html
It’s probably too large to POST to Solr in one big chunk. If you have any problems I should be able to split it with a Python script (and a lotta RAM).
Ah hang on, it’s broken into batches of 100,000. That might work.
In case it helps, if you can run jq, you can split the single JSON file into JSONLines format so each line is one element of the original array:
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' doaj_article_data_2020-04-01/article_batch_1.json > doaj_article_data_2020-04-01/article_batch_1.jsonl
You could then split the jsonl file into smaller chunks, and then use those. I believe Solr supports jsonl format so you should be able to POST them directly into Solr.