netarchivesuite / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
0 stars 0 forks source link

Support offsets for Twitter JSON-Lines #8

Open tokee opened 2 years ago

tokee commented 2 years ago

The some-branch adds support for direct indexing of JSON-Lines from the Twitter API. Retrieval is handled by storing the raw JSON in the Solr field tw_json, but that inflates the index.

Another possibility is to index offsets into the JSON-Lines file, just like it is done for WARC files. This seems reasonably simple for uncompressed input, but does not work when the input is standard gzipped. The solution is (again) to look at WARCs and provide support for concatenations of multiple gzip archives, one per tweet.

tokee commented 2 years ago

The branch repack_jsonlines is work in progress and attempts to add both a repacker and a reader that tracks offsets in concatenated gzip files.