subugoe / leine

Data Pipelines for @subugoe/wag
https://subugoe.github.io/leine
MIT License
1 stars 0 forks source link

lock down crossref snapshot versions #15

Open maxheld83 opened 3 years ago

maxheld83 commented 3 years ago

so ... it appears that these snapshots may sometimes changed after the fact, for example:

This could really mess up our reproducibility.

for example:

wget --server-response --spider --verbose \
  https://api.crossref.org/snapshots/monthly/2018/04/all.json.tar.gz
>   https://api.crossref.org/snapshots/monthly/2018/04/all.json.tar.gz
Spider mode enabled. Check if remote file exists.
--2021-04-12 21:55:40--  https://api.crossref.org/snapshots/monthly/2018/04/all.json.tar.gz
Resolving api.crossref.org (api.crossref.org)... 208.254.38.72
Connecting to api.crossref.org (api.crossref.org)|208.254.38.72|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 302 Found
  server: Apache-Coyote/1.1
  crossref-deployment-name: svc1b-1
  location: https://s3.amazonaws.com/org.crossref.snapshots/monthly/2018/04/all.json.tar.gz?Signature=M6eTWtV8BGlFQJUYsM%2BtiFS%2B57c%3D&AWSAccessKeyId=AKIAXKMFHONDMY2XFPDT&Expires=1618258241
  content-length: 0
  date: Mon, 12 Apr 2021 19:55:41 GMT
  connection: close
Location: https://s3.amazonaws.com/org.crossref.snapshots/monthly/2018/04/all.json.tar.gz?Signature=M6eTWtV8BGlFQJUYsM%2BtiFS%2B57c%3D&AWSAccessKeyId=AKIAXKMFHONDMY2XFPDT&Expires=1618258241 [following]
Spider mode enabled. Check if remote file exists.
--2021-04-12 21:55:41--  https://s3.amazonaws.com/org.crossref.snapshots/monthly/2018/04/all.json.tar.gz?Signature=M6eTWtV8BGlFQJUYsM%2BtiFS%2B57c%3D&AWSAccessKeyId=AKIAXKMFHONDMY2XFPDT&Expires=1618258241
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.26.78
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.26.78|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  x-amz-id-2: MFLya0RX2V5qwqRdOBIhwQtHJabaVOD9I+AXIZX5KYbkz6hyWJejqJgmz64GOJXY4VRJ1N1rw2E=
  x-amz-request-id: CMHYY55S0XR4ASZB
  Date: Mon, 12 Apr 2021 19:55:42 GMT
  Last-Modified: Thu, 17 May 2018 14:15:35 GMT
  ETag: "ed118e2ceb8d05d5bcf53f92fbef2511-2881"
  x-amz-tagging-count: 2
  x-amz-version-id: null
  Accept-Ranges: bytes
  Content-Type: application/x-tar
  Content-Length: 48333216290
  Server: AmazonS3
Length: 48333216290 (45G) [application/x-tar]
Remote file exists.

also recently:

wget --server-response --spider --verbose \
  https://api.crossref.org/snapshots/monthly/2020/09/all.json.tar.gz
maxheld83 commented 3 years ago

I think as a (minimal) response we should lock down the (checksums?) in ETag.