gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

T152 cat json #152 #154

Closed dolsysmith closed 2 years ago

dolsysmith commented 2 years ago

Concatenates gzipped JSON extracts by date in the filename. If this produces too many files, this enhancement also supports concatenating the files up to a specified maximum file size (e.g., 2G), using the environment variable provided for the Spark extracts. (At this point, the choice is hard-coded, but in the future, it could also be specified as a command-line parameter.)

Testing

Load dataset and ensure that JSON files are visible, downloadable, and unzip properly.

lwrubel commented 2 years ago

I tested loading files from SFM by date, editing the code to load by size, and loading files by date with the filenames not matching the pattern (got the appropriate error logged and raised).

loader.docker-compose.yml needs the volumes lines commented back out so that it works on prod where the code is not available to be linked in.

With that, looks good to merge.