Closed jameshadfield closed 4 months ago
Here's all the days where we had multiple datasets of different filenames where we have to choose one to represent the "snapshot" for the day. The links will show the dataset in Auspice. The first filename is the one the indexer will use for the day (as it was the most recently updated). The takeaways:
ncov.json
never appears in the index after 2020-04-07, despite it being regularly re-uploaded to S3. Good.ncov/open/:region
to ncov/open/:region/:timeWindow
and we uploaded both sets that day. The datasets are quite different, but the indexer is taking the 6m version which is correct.2022-04-30
| ncov_open_europe_6m.json, ncov_open_europe.json2022-04-30
| ncov_open_global_6m.json, ncov_open_global.json2022-04-30
| ncov_open_north-america_6m.json, ncov_open_north-america.json2022-04-30
| ncov_open_asia_6m.json, ncov_open_asia.json2022-04-30
| ncov_open_oceania_6m.json, ncov_open_oceania.json2022-04-30
| ncov_open_africa_6m.json, ncov_open_africa.json2022-04-30
| ncov_open_south-america_6m.json, ncov_open_south-america.json2021-02-15
| ncov_global.json, ncov.json2021-02-08
| ncov_global.json, ncov.json2021-02-04
| ncov_global.json, ncov.json2021-02-02
| ncov_global.json, ncov.json2021-01-29
| ncov_global.json, ncov.json2021-01-27
| ncov_global.json, ncov.json2021-01-25
| ncov_global.json, ncov.json2021-01-21
| ncov_global.json, ncov.json2021-01-19
| ncov_global.json, ncov.json2021-01-16
| ncov_global.json, ncov.json2021-01-13
| ncov_global.json, ncov.json2021-01-11
| ncov_global.json, ncov.json2021-01-07
| ncov_global.json, ncov.json2021-01-05
| ncov_global.json, ncov.json2021-01-04
| ncov_global.json, ncov.json2021-01-01
| ncov_global.json, ncov.json2020-12-28
| ncov_global.json, ncov.json2020-12-20
| ncov_global.json, ncov.json2020-12-18
| ncov_global.json, ncov.json2020-12-16
| ncov_global.json, ncov.json2020-12-10
| ncov_global.json, ncov.json2020-12-08
| ncov_global.json, ncov.json2020-12-04
| ncov_global.json, ncov.json2020-11-30
| ncov_global.json, ncov.json2020-11-27
| ncov_global.json, ncov.json2020-11-26
| ncov_global.json, ncov.json2020-11-24
| ncov_global.json, ncov.json2020-11-20
| ncov_global.json, ncov.json2020-11-18
| ncov_global.json, ncov.json2020-11-16
| ncov_global.json, ncov.json2020-11-14
| ncov_global.json, ncov.json2020-11-12
| ncov_global.json, ncov.json2020-11-10
| ncov_global.json, ncov.json2020-11-06
| ncov_global.json, ncov.json2020-11-04
| ncov_global.json, ncov.json2020-11-02
| ncov_global.json,`ncov.json2020-10-29
| ncov_global.json, ncov.json2020-10-27
| ncov_global.json, ncov.json2020-10-23
| ncov_global.json, ncov.json2020-10-21
| ncov_global.json, ncov.json2020-10-19
| ncov_global.json, ncov.json2020-10-15
| ncov_global.json, ncov.json2020-10-13
| ncov_global.json, ncov.json2020-10-09
| ncov_global.json, ncov.json2020-10-07
| ncov_global.json, ncov.json2020-10-05
| ncov_global.json, ncov.json2020-10-01
| ncov_global.json, ncov.json2020-09-29
| ncov_global.json, ncov.json2020-09-25
| ncov_global.json, ncov.json2020-09-23
| ncov_global.json, ncov.json2020-09-21
| ncov_global.json, ncov.json2020-09-03
| ncov_global.json, ncov.json2020-04-23
| ncov_global.json, ncov.json2020-04-21
| ncov_global.json, ncov.json2020-04-20
| ncov_global.json, ncov.json2020-04-19
| ncov_global.json, ncov.json2020-04-18
| ncov_global.json, ncov.json2020-04-17
| ncov_global.json, ncov.json2020-04-16
| ncov_global.json, ncov.json2020-04-15
| ncov_global.json, ncov.json2020-04-14
| ncov_global.json, ncov.json2020-04-12
| ncov_global.json, ncov.json2020-04-11
| ncov_global.json, ncov.json2020-04-10
| ncov_global.json, ncov.json2020-04-09
| ncov_global.json, ncov.json2020-04-08
| ncov_global.json, ncov.jsonThese redirects don't consider snapshot URLs, e.g. "http://localhost:5000/monkeypox/mpxv?c=region" redirects appropriately, "http://localhost:5000/monkeypox/mpxv@2023-01-26?c=region" 404s (despite there being a monkeypox_mpxv.json
uploaded that day. I'll fix this up.
Update: Functionality added in fc330b1
Recent work surfaced previous dataset versions across URL redirects¹ by mirroring the dataset-name resolution process we use for requests on the server. However it neglected to consider the redirects which are handled prior to this in the server. This commit adds that functionality as well. This situation was recently discussed in slack².
To use mpox as an example: the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12.
Our usage of
ncov_global.json
(similarly for other regions) is a lot more complex, because we didn't make clean dataset name switches like monkeypox/mpox.Looking at the current live index (i.e. before this PR) the
ncov.json
dataset stops being uploaded 2020-04-23 and then starts being uploaded again on 2020-09-03, leaving a large gap in the snapshot history. When we do considerncov_global.json
(this PR), we fill in this gap with 99 snapshots. Great!However
ncov.json
continued to be uploaded through 2021-02-15. In cases such as this the indexer will pick the last uploaded version in a given day. However looking at the data it's clearncov.json
was not being rebuilt, simply re-uploaded. For instance, here are nextstrain.org URLs you can view the data in:2020-09-03: ncov.json, ncov_global.json
2021-02-15: ncov.json, ncov_global.json
Looking at this data it's clear that we should drop any
ncov.json
datasets after 2020-04-23, which ~I'll add to this PR now~ update: no need to programmatically drop, see next message in this PRCloses #784
¹ https://github.com/nextstrain/nextstrain.org/pull/783 ² https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939