nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
36 stars 20 forks source link

Revisit fetch from GISAID #242

Open joverlee521 opened 2 years ago

joverlee521 commented 2 years ago

Context
On Dec 2, 2021, multiple fetch-and-ingest runs for GISAID failed. The failure pattern was we would download for a while and the transfer would get closed before it's completed. Subsequent attempts to fetch would hit a 503 error. We manually triggered fetch-and-ingest two more times and saw the same failure pattern.

Possible solution
The scheduled run today had no issues, so this may have just been unfortunate timing of our runs being interrupted by GISAID's reboots. We can revisit the following solutions in anticipation of similar future issues:

  1. Manual downloads from the same API endpoint were able to complete successfully when done without streaming decompression. We can update fetch-from-gisaid to stop decompression during streaming to lower the open connection time. However, decompressing in a separate step this would increase the total time to run fetch-and-ingest.
  2. Switch to an endpoint with xz, which has better compression ratio and decompression time than bzip2. Regardless of errors, this would be a huge improvement for us and dramatically decrease fetch-and-ingest runtime.
ivan-aksamentov commented 2 years ago

@joverlee521

Switch to an endpoint with xz.

I did not know it exists. Do you know the URL? Does it have the same data in it?

In the meantime we could try parallel bzip also: https://github.com/nextstrain/ncov-ingest/pull/247

tsibley commented 2 years ago

I did not know it exists. Do you know the URL? Does it have the same data in it?

Ah, it does not exist, as far as we know. This would be asking GISAID to switch to xz for us for the current export we get.