gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

Large file downloads are timing out #139

Open kerchner opened 3 years ago

kerchner commented 3 years ago

Currently on production:

As a result, .jsonl.zip files are around 11 GB in size, which would take around 30-60 minutes to download, except this exceeds the timeout of 600 seconds ( = 10 minutes).

Consistent with this, @dolsysmith notes some gunicorn errors in the log on production: [2021-08-13 10:54:40 -0400] [7] [CRITICAL] WORKER TIMEOUT (pid:64)

Suggested remediations:

dolsysmith commented 3 years ago

For full extracts, we can control the number of rows per file with the maxRecordsPerFile Spark parameter.

dolsysmith commented 1 year ago

The Spark loader compresses the CSV files but does not archive them. For a large dataset (e.g., Coronavirus), archiving the gzipped CSV files can be done manually after loading via the following bash script (currently saved on production in /opt/TweetSets/chunk_csv.sh.

It yields zipped files of roughly 1.2 G, each of which, unzipped, contains 5 gzipped CSV files of 1M rows max.

# To zip csv.gz files by chunks of 5
# Command-line argument should be the directory containing the files to be zipped
ls "$1"/*.csv.gz > csvfiles
split -d -l5 - csvfiles < csvfiles
counter=1
for i in csvfiles[0-9][0-9]; do
  #cat $i # For testing
  # Get the parent directory
  parentdir="$(dirname "$1")"
  # The -j flag means "junk paths"
  zip -j "$parentdir/tweets-$((counter++)).csv.zip" -@ < "$i"
done

rm csvfiles
rm csvfiles[0-9][0-9]