The some-branch adds support for direct indexing of JSON-Lines from the Twitter API. Retrieval is handled by storing the raw JSON in the Solr field tw_json, but that inflates the index.
Another possibility is to index offsets into the JSON-Lines file, just like it is done for WARC files. This seems reasonably simple for uncompressed input, but does not work when the input is standard gzipped. The solution is (again) to look at WARCs and provide support for concatenations of multiple gzip archives, one per tweet.
The
some
-branch adds support for direct indexing of JSON-Lines from the Twitter API. Retrieval is handled by storing the raw JSON in the Solr fieldtw_json
, but that inflates the index.Another possibility is to index offsets into the JSON-Lines file, just like it is done for WARC files. This seems reasonably simple for uncompressed input, but does not work when the input is standard gzipped. The solution is (again) to look at WARCs and provide support for concatenations of multiple gzip archives, one per tweet.