ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
294 stars 89 forks source link

[FEATURE REQUEST] direct S3 access to input datasets #274

Closed jvolkening closed 6 months ago

jvolkening commented 9 months ago

Is your feature request related to a problem? Please describe.

When running PGAP on an EC2 instance, or via AWS Batch, downloading "input-version.tgz" file from the S3 REST endpoint (e.g. "https://s3.amazonaws.com/pgap/input-*version*.tgz") is an order of magnitude slower than directly fetching from an S3 bucket. Speeds will vary depending on instance type and other setup, but last time I measured it was ~ 16 minutes to download from the REST endpoint over HTTPS, and ~ 1 minute to download the same file (that I copied there) directly from an S3 bucket using the AWS CLI tool.

Describe the solution you'd like

Would it be possible to make the S3 bucket itself publicly available read-only? Alternatively, could the input files be made available through the AWS Open Data registry?

Describe alternatives you've considered

Copying each new release file to our own private S3 bucket and using that for pipelines, or else accepting the increased run times for the pipeline.

azat-badretdin commented 9 months ago

Thank you for your request, Jeremy! We will have a look on how to optimize the speed of downloading and unpacking.

Did you test both downloading and unpacking speed together? Or just downloading?

jvolkening commented 9 months ago

Did you test both downloading and unpacking speed together? Or just downloading?

Thanks -- originally just the download speed, but I went back and tested with unpacking as well (piped on-the-fly). For curl transfer, network transfer is limiting and unpacking on-the-fly has no effect on elapsed time. For aws s3 transfer, the decompression is limiting and it increases total elapsed time for download+unpacking from ~ 1 min to ~ 2 min -- still significantly faster than over HTTPS.

I had a thought that maybe the difference was due to within-region vs inter-region transfer, so I tested a few additional files both within and between regions. The speed between regions (~ 170 MB/s) was slower than within region (~ 240 MB/s) but both were still at least 10x faster than curl (~ 16 MB/s).

I should have mentioned that we are implementing this by calling CWL directly in a pipeline rather than with the pgap.py wrapper.

azat-badretdin commented 9 months ago

Thank you very much for your detailed reply, Jeremy.

I should have mentioned that we are implementing this by calling CWL directly in a pipeline rather than with the pgap.py wrapper.

Yep. That's what I would do in your position. Internally we also have a pipeline that goes for massive application of PGAP-external via direct execution of one of the CWL graphs inside the docker image

The problem of importing this solution to pgap.py is a conflict between necessity of maintaining very basic nature of this script (for example, we resort only to basic Python packages there) and the solution of using specialized AWS tools to download the data.

In any form, this might require additional pre-requisites on the user site which usually creates an additional acception threshold for the average user limiting our marketing audience.

I will initiate a discussion within group on this subject.

george-coulouris commented 7 months ago

Would it be possible to make the S3 bucket itself publicly available read-only?

Hi @jvolkening, I've made this change, please give it a try.

jvolkening commented 6 months ago

Sorry, this fell off my radar for a while...works great now with the direct transfer from S3. Thanks.