gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

Update path to full extracts #140

Closed dolsysmith closed 2 years ago

dolsysmith commented 3 years ago

Spark always creates a directory when writing files to disk. As a result, extracts created during ingest will be in subdirectories as follows:

-- dataset_directory |tweet_ids |__tweet_json |tweet_csv

The current implementation stores all extracts in the same directory (under full_datasets). We could handle this in one of at least two ways:

  1. Have the loader script move the files to a single directory after Spark is finished.
  2. Refactor tweetset_server.py to check for individual directories.
dolsysmith commented 3 years ago

Also need to create a shared directory on prod for storing the full extracts (distinct from the shared TS/SFM directory used for loading them).

dolsysmith commented 3 years ago

Copy JSON files to full extracts directory instead of creating them anew.

dolsysmith commented 3 years ago

Target number of tweets per extract type to achieve 2GB per file:

json: 511792 tweets csv: 10865173 tweets mentions/edges 137920613 tweets mentions/nodes: 259158572 tweets agg-mentions: 348209632 tweets ids: 222774986 tweets users: 49982150 tweets

The numbers were derived from extracts created on the Congress 115 dataset. To arrive at these estimates, I counted the number of lines in each file, divided the size of each file by the number of lines, and took the average score (= number of bytes per row) for each extract. Then I divided our target max file size (2GB) by this score to find the number of rows (tweets) that can fit into each extract type, assuming our target size.

lwrubel commented 3 years ago

Possible language to add to the full datasets page:

Windows users may need to use an application such as 7-Zip to unzip and open files with a .gz extension.