Break up Elasticsearch files into smaller files

openva / crump

A parser for the Virginia State Corporation Commission's business registration records.

https://vabusinesses.org/

MIT License

20 stars 3 forks source link

Break up Elasticsearch files into smaller files #56

Closed waldoj closed 10 years ago

waldoj commented 10 years ago

Elasticsearch chokes on large files, but is fine with smaller ones. We'll need to experiment to figure out how much data it can take at one time. While hardly difficult to do, this complicates things substantially. I took 2_corporate.json down to 10,000 lines, and that was ingested just fine. Next, try 100,000 lines, and then step up by 100,000 at a time.

waldoj commented 10 years ago

It might also be worth trying to chop up the files after outputting them, rather than at the time that they're output. At the moment, we open and close each file at the head of the loop through file types, and modifying that is going to be a pain. It may be a lot easier to add a new stanza, post-output, that handles this.

waldoj commented 10 years ago

I think this is going to require an entirely new function. It should:

iterate through *.json (skipping the first one)
count the number of lines
if the number of lines is below the the determined threshold, skip to the next file
iterate through each line of the file until reaching the threshold
delete the trailing comma, append the JSON terminator (\n]\n), and save the file as originalname.##.json

waldoj commented 10 years ago

3_lp.json, at 22,091 lines in length, imported just fine. And 7_merger.json, at 119,045 lines, was also fine. Even 4_amendments.json, at 225,777 lines, imported without difficulty. So the ceiling may be pretty high.

waldoj commented 10 years ago

5_officers.json is the biggest file, at 1.3M lines right now. If we can get 225,000 lines per file, that'll require 6 files.

waldoj commented 10 years ago

Note that odd-numbered lines contain Elasticsearch metadata and the even-numbered lines contain the corresponding data. So make sure that the output files each contain even numbers of lines.