Closed waldoj closed 10 years ago
It might also be worth trying to chop up the files after outputting them, rather than at the time that they're output. At the moment, we open and close each file at the head of the loop through file types, and modifying that is going to be a pain. It may be a lot easier to add a new stanza, post-output, that handles this.
I think this is going to require an entirely new function. It should:
*.json
(skipping the first one)\n]\n
), and save the file as originalname.##.json
3_lp.json
, at 22,091 lines in length, imported just fine. And 7_merger.json
, at 119,045 lines, was also fine. Even 4_amendments.json
, at 225,777 lines, imported without difficulty. So the ceiling may be pretty high.
5_officers.json
is the biggest file, at 1.3M lines right now. If we can get 225,000 lines per file, that'll require 6 files.
Note that odd-numbered lines contain Elasticsearch metadata and the even-numbered lines contain the corresponding data. So make sure that the output files each contain even numbers of lines.
Elasticsearch chokes on large files, but is fine with smaller ones. We'll need to experiment to figure out how much data it can take at one time. While hardly difficult to do, this complicates things substantially. I took
2_corporate.json
down to 10,000 lines, and that was ingested just fine. Next, try 100,000 lines, and then step up by 100,000 at a time.