nextstrain / nextstrain.org

The Nextstrain website
https://nextstrain.org
GNU Affero General Public License v3.0
87 stars 49 forks source link

[resource indexer] include dataset redirects in index #791

Closed jameshadfield closed 4 months ago

jameshadfield commented 5 months ago

Recent work surfaced previous dataset versions across URL redirects¹ by mirroring the dataset-name resolution process we use for requests on the server. However it neglected to consider the redirects which are handled prior to this in the server. This commit adds that functionality as well. This situation was recently discussed in slack².

To use mpox as an example: the dataset (URL) path "mpox/all-clades" now includes previous versions which were named "monkeypox_mpxv.json", thus extending the snapshot history of this dataset from 2023-09-23 to 2022-06-12.

Our usage of ncov_global.json (similarly for other regions) is a lot more complex, because we didn't make clean dataset name switches like monkeypox/mpox.

Looking at the current live index (i.e. before this PR) the ncov.json dataset stops being uploaded 2020-04-23 and then starts being uploaded again on 2020-09-03, leaving a large gap in the snapshot history. When we do consider ncov_global.json (this PR), we fill in this gap with 99 snapshots. Great!

However ncov.json continued to be uploaded through 2021-02-15. In cases such as this the indexer will pick the last uploaded version in a given day. However looking at the data it's clear ncov.json was not being rebuilt, simply re-uploaded. For instance, here are nextstrain.org URLs you can view the data in:

Looking at this data it's clear that we should drop any ncov.json datasets after 2020-04-23, which ~I'll add to this PR now~ update: no need to programmatically drop, see next message in this PR

Closes #784

¹ https://github.com/nextstrain/nextstrain.org/pull/783 ² https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939

jameshadfield commented 5 months ago

Here's all the days where we had multiple datasets of different filenames where we have to choose one to represent the "snapshot" for the day. The links will show the dataset in Auspice. The first filename is the one the indexer will use for the day (as it was the most recently updated). The takeaways:

  1. ncov.json never appears in the index after 2020-04-07, despite it being regularly re-uploaded to S3. Good.
  2. 2022-04-30 was the first day we switched from ncov/open/:region to ncov/open/:region/:timeWindow and we uploaded both sets that day. The datasets are quite different, but the indexer is taking the 6m version which is correct.
jameshadfield commented 4 months ago

These redirects don't consider snapshot URLs, e.g. "http://localhost:5000/monkeypox/mpxv?c=region" redirects appropriately, "http://localhost:5000/monkeypox/mpxv@2023-01-26?c=region" 404s (despite there being a monkeypox_mpxv.json uploaded that day. I'll fix this up.

Update: Functionality added in fc330b1