Large file downloads are timing out

kerchner commented 3 years ago

Currently on production:

MAX_PER_JSON_FILE is set to the default of 10,000,000
MAX_PER_CSV_FILE is set to the default of 250,000
Transfer bandwidth on the server side appears to be about 10 MBps
gunicorn SERVER_TIMEOUT is set to 600 seconds

As a result, .jsonl.zip files are around 11 GB in size, which would take around 30-60 minutes to download, except this exceeds the timeout of 600 seconds ( = 10 minutes).

Consistent with this, @dolsysmith notes some gunicorn errors in the log on production: [2021-08-13 10:54:40 -0400] [7] [CRITICAL] WORKER TIMEOUT (pid:64)

Suggested remediations:

Find out from WRLC whether bandwidth can be increased
Reduce MAX_PER_JSON_FILE (and strongly consider using the same values for this and MAX_PER_CSV_FILE, for JSON and CSV files that correspond to each other
Adjust timeout to ensure that JSON zip files (given new MAX_PER_JSON_FILE value) can usually be downloaded without encountering a timeout error. Consider whether there may be negative side effects of using a higher timeout.

dolsysmith commented 3 years ago

For full extracts, we can control the number of rows per file with the maxRecordsPerFile Spark parameter.

dolsysmith commented 1 year ago

The Spark loader compresses the CSV files but does not archive them. For a large dataset (e.g., Coronavirus), archiving the gzipped CSV files can be done manually after loading via the following bash script (currently saved on production in /opt/TweetSets/chunk_csv.sh.

It yields zipped files of roughly 1.2 G, each of which, unzipped, contains 5 gzipped CSV files of 1M rows max.

# To zip csv.gz files by chunks of 5
# Command-line argument should be the directory containing the files to be zipped
ls "$1"/*.csv.gz > csvfiles
split -d -l5 - csvfiles < csvfiles
counter=1
for i in csvfiles[0-9][0-9]; do
  #cat $i # For testing
  # Get the parent directory
  parentdir="$(dirname "$1")"
  # The -j flag means "junk paths"
  zip -j "$parentdir/tweets-$((counter++)).csv.zip" -@ < "$i"
done

rm csvfiles
rm csvfiles[0-9][0-9]

gwu-libraries / TweetSets

Large file downloads are timing out #139