samvera / valkyrie

A Data Mapper library to enable multiple backends for storage of files and metadata in Samvera
Other
34 stars 23 forks source link

Valkyrie::StorageAdapter::File - insert a higher abstraction to support cloud providers? #954

Open dchandekstark opened 4 months ago

dchandekstark commented 4 months ago

I am working on a storage adapter for S3 (in a somewhat different fashion from https://github.com/stkenny/valkyrie-storage-s3), and I find that Valkyrie::StorageAdapter::File feels awkward in a couple of spots. I wonder whether we could insert a higher level abstraction above it to support find_by in a more "natural" way for cloud providers.

First, #disk_path IMO extends the responsibility of V::SA::F too far, and StreamFile feels too opinionated (and appears to create but not clean up temp files?). Likewise rewind, close, and read seem bound to Ruby's IO class -- nothing wrong with that as such, but if you're dealing with a cloud resource, it's not really "on disk" and it's not really an IO either. What I might suggest for a higher level abstraction is a class that minimally implements #each to yield chunks of the file data; I could also see #stream to get an IO. #size is probably fine too, fwiw.

Here's a partial illustration, where the io used to instantiate the class is an instance of Aws::S3::Object

def each
  io.get do |chunk, _headers|
    yield chunk
  end
end

# @return [IO]
def stream
   io.get.body
end

Interested in any and all feedback. Thanks!

tpendragon commented 4 months ago

@dchandekstark Do you have a reference library that handles file abstraction the way you're proposing? We've leaned on IO because it's how folks expect to handle files, and on #disk_path because it's often just so necessary to be able to get that file locally for derivative processing.

I'm interested in StreamFiles not cleaning up, that wasn't the intention if it's true. A brief look at the code does look like that's the case though - folks would have to clear out that tmp dir occasionally.

tpendragon commented 4 months ago

@dchandekstark Have you looked at valkyrie-shrine for S3 support? That's largely what we've done.

dchandekstark commented 4 months ago

@dchandekstark Have you looked at valkyrie-shrine for S3 support? That's largely what we've done.

@tpendragon I haven't. At this stage, I'm looking for the "simplest thing that could possibly work"(tm), but I'll definitely check it out. I actually contemplated trying to use ActiveStorage in a more or less similar way, but it felt over-engineered -- like I should just switch to using ActiveStorage instead of trying to integrate it into Valkyrie as another storage adapter.

tpendragon commented 4 months ago

@dchandekstark It's pretty simple. In figgy we have this:

https://github.com/pulibrary/figgy/blob/c575619bfeb42a35ed0545644b21034214f62a75/config/initializers/valkyrie.rb#L129-L145

then that adapter works just like the disk one.

dchandekstark commented 4 months ago

@dchandekstark Do you have a reference library that handles file abstraction the way you're proposing? We've leaned on IO because it's how folks expect to handle files, and on #disk_path because it's often just so necessary to be able to get that file locally for derivative processing.

I'm thinking of something along the lines of the Rack spec? -- not that I'm well versed in it -- but at least the low-level notion of an enumerable.

I get the disk_path thing, but if stored file doesn't have a path natively, then I think it's the responsibility of the party that wants a path to download the data. Where and how that needs to happen can vary, and I don't really want to embed the logic into the adapter.

tpendragon commented 4 months ago

Where and how that needs to happen can vary, and I don't really want to embed the logic into the adapter.

I get that, but the motivation is that I want to suddenly be able to decide "Actually I don't want to be on disk, I want to be in S3" and have that be as easy and painless a transition as possible.

dchandekstark commented 4 months ago

I get that, but the motivation is that I want to suddenly be able to decide "Actually I don't want to be on disk, I want to be in S3" and have that be as easy and painless a transition as possible.

That's fair. I do wish the temp file stuff in StreamFile could be addressed, but I'm not sure how without impacting disk_path in a way that undermines the purpose of it. It's sort of indeterminate how long the path is supposed to persist.

tpendragon commented 4 months ago

We could use Tempfile maybe, or disk_path could accept a block and clean up after itself when the block ends πŸ€”

dchandekstark commented 4 months ago

We could use Tempfile maybe, or disk_path could accept a block and clean up after itself when the block ends πŸ€”

Tempfile + optional block would work I think.

dchandekstark commented 4 months ago

@tpendragon One more consideration on disk_path: I want to use x-sendfile to have my web server stream files that are actually on locally mounted file systems. In that case, I don't want to inadvertently download the S3 file to disk and then send it. Is there a workaround?

cgalarza commented 4 months ago

Hello! I thought I would chime in because I recently ran into a problem where I was using disk_path and had assume that it was a temp file that would get cleaned up on its own, but it didn't and then my server ran out of space, etc etc. I think part of my issue occurred because we deploying in a containerized environment and can't depend on using a tmp directory that a server might cleanup (but I am far from an expert on this).

I ended up fixing the problem by creating a method just like @tpendragon described. It accepts a block and cleans up once the block ends. At the time spent some time reading the TempFile documentation and it seemed like it is really encouraging manually closing/removing a TempFile. Another possible option I considered was creating two methods, one to create a temp file and another one to close it.

My suggestion would be keep disk_path as is, but add a method to explicitly close the file and add an additional method to take a block.

Just my two cents.

tpendragon commented 4 months ago

@tpendragon One more consideration on disk_path: I want to use x-sendfile to have my web server stream files that are actually on locally mounted file systems. In that case, I don't want to inadvertently download the S3 file to disk and then send it. Is there a workaround?

That's interesting! We do this for disk files and it does fail for our cloud ones. I think you could either proxy through nginx or apache or whatever for these, or I've considered adding a download_url to StreamFile that no-ops - I think Shrine does something similar?

tpendragon commented 4 months ago

I ended up fixing the problem by creating a method just like @tpendragon described. It accepts a block and cleans up once the block ends. At the time spent some time reading the TempFile documentation and it seemed like it is really encouraging manually closing/removing a TempFile. Another possible option I considered was creating two methods, one to create a temp file and another one to close it.

Ahhh this is great, good work @cgalarza !

tpendragon commented 4 months ago

I think I'm gonna look at adding a block to disk_path and make StreamFile clean up using it, and make close on StreamFile ensure any local tmp file is deleted.

tpendragon commented 4 months ago

@dchandekstark @cgalarza How's https://github.com/samvera/valkyrie/pull/955 look?

tpendragon commented 4 months ago

@dchandekstark Years ago I converted Figgy to cloud only as an experiment and I remember doing a thing for Downloading - https://github.com/pulibrary/figgy/compare/main...figgy_in_the_cloud might be useful to you?

(We've since decided to keep files local, but it was a good experiment in how flexible Valkyrie really is πŸ˜€ )

dchandekstark commented 4 months ago

@dchandekstark Years ago I converted Figgy to cloud only as an experiment and I remember doing a thing for Downloading - pulibrary/figgy@main...figgy_in_the_cloud might be useful to you?

Yes. It also illustrates (in the respond_to? tests) that path is not always (easily) available.

dchandekstark commented 4 months ago

@dchandekstark @cgalarza How's #955 look?

@tpendragon Seems like a good start! I do confess a little concern that explicit temp file naming is potentially unsafe (e.g., what happens when separate threads/processes act on the same file?).

dchandekstark commented 4 months ago

@tpendragon I feel that I should add to this thread that I appreciate what Valkyrie provides with respect to storage. It's simple, yet effective. I can see the attraction to Shrine and even ActiveStorage, but you don't always need or want that much extra functionality associated with your storage capability. With that said, I thought I would try to fit a cloud provider (in this case S3) into the Valkyrie framework more or less directly and adhere to the storage adapter contracts.

tpendragon commented 4 months ago

I'm all for more options! Looking forward to it!