mjordan / islandora_workbench

A command-line tool for managing content in an Islandora 2 repository
MIT License
24 stars 40 forks source link

S3? #305

Open ajstanley opened 3 years ago

ajstanley commented 3 years ago

We should be able to pull S3 files down (with proper creds). I'm thinking we could parse the contents of a cell's contents in the file column, and if it was in the format of (value, value, value) we could feed it to the input_dir as s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME').

I'd be happy to take this one on if it seems like a good idea to anyone else

mjordan commented 3 years ago

Sure, sounds good. It might be good to wait to open a PR until I've merged in #287 (I'm still working on the add_media task), which should be next week (I've put aside some vacation time to finish this one).

Would the ability to fetch files from s3 require a new library?

ajstanley commented 3 years ago

Yep - one new library. I'll wait til 287 is merged - it should be pretty strightforward.

g7morris commented 2 years ago

Hi @ajstanley & @mjordan

Oooo this would be very cool (and essential to cloud-based workflows) to build as a feature. Do either of you have any plans to pursue this now that the PR above was merged in?

Would be happy to help test and document if that helps.

Thanks in advance!

ajstanley commented 2 years ago

@g7morris I kind of abandoned this when I found I could get everything from the S3 bucket via a regular URL, but I can have another go if the interest is there.

g7morris commented 2 years ago

@ajstanley If that is the case, perhaps it is more of just documentation then? Perhaps if folks know that they can call a S3 URL from the tool or within the csv that is enough?

g7morris commented 2 years ago

Either way to call from the S3 bucket from within the tool or as notated in a csv via the S3 bucket URL would be most appreciated. Thank you both

ajstanley commented 2 years ago

I'd kinda like to see various downloads as plugins - local, https, s3 ...

g7morris commented 2 years ago

That sounds good too @ajstanley but in terms of using an S3 URL is it as easy as putting in the S3 bucket URL and then the object / file name in the associated csv for Workbench? I assume one might need to make the S3 publicly read only?

Ideally one could do all of this privately from an Ec2 server using aws cli and calling the s3: directly? This would make ingest fly comparatively.

mjordan commented 2 years ago

I'd kinda like to see various downloads as plugins - local, https, s3 .

If people would find native s3 support useful, let's add it. I originally steered away from plugins in Workbench to keep things simple, albeit "opinionated". But, I am now starting to think that enough use cases and real-world usage has emerged to justify considering them. Addressing some bugs and optimizations has prevented me from sustaining dev time on items 1 and 2 on the Fall 2021 Roadmap, and those features are really needed, so I'd rather stick to those two items. But @ajstanley if you want to sketch out how file-source plugins would work, I'd be open to discussing. One question I have before committing though is, beyond native s3 support, what other file sources would people want? If we can't think of any more, maybe we don't need the extra flexibility and complexity of plugins. Could be convinced either way.

ajstanley commented 2 years ago

@g7morris - I have limited experience with s3 - On the one project where I used them extensively we were able to get regular https: addresses for each of the objects we needed. I think the advantage of using the s3 library is that we could pass credentials to get restricted material.

@mjordan Instead of actual plugins if we created a file class with methods for the various kinds of input, we could easily sniff out the logic to direct to the right method - ie if s3 in filename: file.getS3(filename). Not a plugin as such, but some neatly organized code with instance variables to cover commonalities.

mjordan commented 2 years ago

@ajstanley yeah, this is roughly how workbench_fields.py works now. I like this approach, since it lets us define a simple interface/logic in the main workbench and workbench_utils files but is fairly easy to extend.

Also makes it easier to write tests!

dara2 commented 1 year ago

Hi @mjordan and @ajstanley - We are having some issues because of Workbench needing S3 buckets to be public. Is this on the roadmap - to allow for Workbench to pull files from a private S3 bucket?

mjordan commented 1 year ago

I'm afraid I have zero experience with S3, and will be fully occupied with work on #373 and #292 for the next while.

Rereading the conversation above, I'm not as sanguine about introducing a refactor of the file-fetching code as I was back in January, mainly because I've fallen behind on making sure integration tests continue to work (a bunch are broken right now). I wouldn't want to introduce such a large change without being able to run those tests. Perhaps over the holidays I can spend time outside of #373 and #292 on catching up on making the tests work, which would put us in good shape for introducing a large change like a new file-handling class early in the new year. Workbench is now so large and complex that we really need to be writing and running more integration tests.

dara2 commented 1 year ago

Okay, that all makes sense. Thank you for the update!