twitter / cache-trace

A collection of Twitter's anonymized production cache traces.
Creative Commons Attribution 4.0 International
171 stars 34 forks source link

CMU FTP returns 403 Forbidden #1

Closed GeorgeErickson closed 3 years ago

GeorgeErickson commented 3 years ago

problem The https://ftp.pdl.cmu.edu/pub/datasets/twemcacheWorkload/open_source url is no longer accessible (this used to work in the past).

image

questions

  1. Whats the best way to download this dataset (in the US northeast)?
  2. Could we get checksums of each file (e.g. a sha256)?
GeorgeErickson commented 3 years ago

A requester pays S3 bucket would be extra nice (or something like https://registry.opendata.aws)

1a1a11a commented 3 years ago

Hi George, Thank you for letting us know the ftp is down, I have reported to admin, should be fixed in the next few days. For your questions,

  1. Both CMU ftp and SNIA should be good for US northeast access
  2. Yes, I will try to generate a sha256 checksum for each workload, will update when it finishes.

Requester pays S3 bucket is a good idea, but we may not have a constantly active aws account to pay for the storage cost. SNIA should be able to provide a long-term data access.

1a1a11a commented 3 years ago

Hi @GeorgeErickson, The server is back and sha256 is under the same path.

GeorgeErickson commented 3 years ago

Awesome thanks for the quick fix!

GeorgeErickson commented 3 years ago

A minor issue with the sha256 file: cluster1.0.zst is missing.

get-storj-clusters() {
  curl -sS https://raw.githubusercontent.com/twitter/cache-trace/master/storj_wget.sh \
    | awk '{ print $2 }' \
    | xargs basename
}

get-sha256-clusters() {
  curl -sS https://ftp.pdl.cmu.edu/pub/datasets/twemcacheWorkload/open_source/sha256  \
    | awk '{ print $2 }'
}

diff <(get-storj-clusters | sort -V)  <(get-sha256-clusters | sort -V)

# Output:
#   cluster1.0.zst
1a1a11a commented 3 years ago

updated