Closed joverlee521 closed 1 year ago
Hmm, can we use a built in paginator here instead (i.e. from the AWS SDK)?
Is that an acceptable outcome here? ISTR part of the point of the https://nextstrain.org/search/sars-cov-2 page was to cover all those dated builds?
Good question. This brings up the broader question of whether we want to continue maintaining the /search pages. The /search/sars-cov-2 page was very slow when I was testing locally with only the latest build search results. Should we invest time into speeding up the Select
component?
I think the consensus is we want functionality like the search pages (and to extend that functionality well beyond where it is today). I see the current implementation, though, as an initial "MVP", a stepping stone that was (is?) useful, but isn't the foundation for more extensive, longer-term search functions.
It's on my to-do list to remind me what all our various "S3 crawling" scripts are actually doing, and which functionality is still relevant. SARS-CoV-2 strain search is almost certainly not needed anymore - this was created in a time with (multiple?) orders of magnitude fewer sequences and so there was a decent chance your strain of interest was in a recent build (and this page told you which one). We now subsample out ~99.9% of strains, and more importantly Nextclade exists to analyse your sequences of interest. I'd suggest stopping the script running & adding some text to the page to indicate it's no longer supported.
I still want us to aim for generalised dataset collection with search / filtering, but I don't think this script is a stepping stone to it.
Description of proposed changes
Collect all S3 objects by using the
ContinuationToken
param and fetching all content untilIsTruncated
is no longer true.Related issue(s)
Fixes #670.
Testing
Scripts worked locally with:
This now pulls >6000 datasets from staging and all of the dated datasets for ncov. Should we add additional filters to limit the number of datasets? Maybe limit objects by their
LastModified
date?