sciencehistory / chf-sufia

sufia-based hydra app
Other
9 stars 4 forks source link

Need single job rerun option for DZI tasks #850

Closed sanfordd closed 6 years ago

sanfordd commented 7 years ago

Looking over the task rake chf:dzi:push_all[lazy] if it is making a GET or LIST request to S3 in order to check if the file(s) are there we're getting charged for all the requests to see what files are in S3. Depending on the method being used it could be costing between $1-$10 each time the command gets run, the wide difference being because GET is much cheaper than LIST and I am not 100% sure which one we're using though I suspect it is LIST.
If we can just rerun a single object in error cases it would be much cheaper to handle failed jobs.

jrochkind commented 7 years ago

I'm not sure which operation it is using, it's going through the ruby aws-sdk, not sure what it ends up doing how it was written. We can certainly ensure/change it to use the cheaper one -- which is the cheaper one? But that's interesting to know, hidden cost of s3! We thought it was so cheap!

"Lazy" checks one S3 key per sufia image, not every single S3 key (there are of course multiples for multiple tiles), I'm not sure which one you calculated based on? We have around 10K sufia images.

I was currently planning on running the 'delete orphans' script regularly too though -- that one DOES (possibly) check every single file on S3. Will that be a problem?

It's quite easy to supply a task that just re-creates for a single object. But we should have a shared understanding of costs too, with regard to the bulk create and delete orphans scripts. So let's be sure we understand what's what with those too.

jrochkind commented 7 years ago

But wait, every time we DELIVER these files to users we will surely be charged for a GET.

Were our calculations off, is S3 actually not going to be affordable? That would be alarming. But we can probably fix with a CDN.

sanfordd commented 7 years ago

I was calculating on the S3 total tiles (in the millions) where running against every tile would add up over time. GET requests are fairly cheap at $.004 per 10000, so it doesn't cost that much when we deliver even a fairly large number of tiles. Too much access might eventually bring up costs. The PUT/LIST requests are .005 per 1000, not an issue since we upload less than we share. Iterating over all the tiles though, depending on how we do it, does cost a bit once we're in the millions of files. It's not unbearable, but it we can avoid it too often it helps.

If the lazy regen doesn't go over every s3 tile, we're fine cost wise I think. The orphaned works on the other hand, depending what it uses, would cost us a bit to run. So not something we want to cron to run nightly.

jrochkind commented 6 years ago

This has already been there actually, just never closed the ticket. Can optionally be combined with 'lazy' too.

Can specify a work_id, will create dzi for all filesets in that work. (Warning, doesn't actually get leaf filesets for child works, grr).

WORK_IDS=hq37vp21f ./bin/rake chf:dzi:push_all[lazy]

Or can specify one or more file set IDs:

FILE_SET_IDS=5t34sj56t,1g05fb61q bundle exec rake chf:dzi:push_all[lazy]