nextstrain / nextstrain.org

The Nextstrain website
https://nextstrain.org
GNU Affero General Public License v3.0
87 stars 49 forks source link

scripts: collect all S3 objects when truncated #672

Closed joverlee521 closed 1 year ago

joverlee521 commented 1 year ago

Description of proposed changes

Collect all S3 objects by using the ContinuationToken param and fetching all content until IsTruncated is no longer true.

Related issue(s)

Fixes #670.

Testing

Scripts worked locally with:

./scripts/collect-datasets --keyword flu 
./scripts/collect-datasets --keyword staging

./scripts/collect-search-results --pathogen sars-cov-2
./scripts/collect-search-results --pathogen seasonal-flu

This now pulls >6000 datasets from staging and all of the dated datasets for ncov. Should we add additional filters to limit the number of datasets? Maybe limit objects by their LastModified date?

tsibley commented 1 year ago

Hmm, can we use a built in paginator here instead (i.e. from the AWS SDK)?

joverlee521 commented 1 year ago

Is that an acceptable outcome here? ISTR part of the point of the https://nextstrain.org/search/sars-cov-2 page was to cover all those dated builds?

Good question. This brings up the broader question of whether we want to continue maintaining the /search pages. The /search/sars-cov-2 page was very slow when I was testing locally with only the latest build search results. Should we invest time into speeding up the Select component?

tsibley commented 1 year ago

I think the consensus is we want functionality like the search pages (and to extend that functionality well beyond where it is today). I see the current implementation, though, as an initial "MVP", a stepping stone that was (is?) useful, but isn't the foundation for more extensive, longer-term search functions.

jameshadfield commented 1 year ago

It's on my to-do list to remind me what all our various "S3 crawling" scripts are actually doing, and which functionality is still relevant. SARS-CoV-2 strain search is almost certainly not needed anymore - this was created in a time with (multiple?) orders of magnitude fewer sequences and so there was a decent chance your strain of interest was in a recent build (and this page told you which one). We now subsample out ~99.9% of strains, and more importantly Nextclade exists to analyse your sequences of interest. I'd suggest stopping the script running & adding some text to the page to indicate it's no longer supported.

I still want us to aim for generalised dataset collection with search / filtering, but I don't think this script is a stepping stone to it.