Open dchandekstark opened 7 months ago
@dchandekstark Do you have a reference library that handles file abstraction the way you're proposing? We've leaned on IO because it's how folks expect to handle files, and on #disk_path
because it's often just so necessary to be able to get that file locally for derivative processing.
I'm interested in StreamFiles not cleaning up, that wasn't the intention if it's true. A brief look at the code does look like that's the case though - folks would have to clear out that tmp dir occasionally.
@dchandekstark Have you looked at valkyrie-shrine
for S3 support? That's largely what we've done.
@dchandekstark Have you looked at
valkyrie-shrine
for S3 support? That's largely what we've done.
@tpendragon I haven't. At this stage, I'm looking for the "simplest thing that could possibly work"(tm), but I'll definitely check it out. I actually contemplated trying to use ActiveStorage in a more or less similar way, but it felt over-engineered -- like I should just switch to using ActiveStorage instead of trying to integrate it into Valkyrie as another storage adapter.
@dchandekstark It's pretty simple. In figgy we have this:
then that adapter works just like the disk one.
@dchandekstark Do you have a reference library that handles file abstraction the way you're proposing? We've leaned on IO because it's how folks expect to handle files, and on
#disk_path
because it's often just so necessary to be able to get that file locally for derivative processing.
I'm thinking of something along the lines of the Rack spec? -- not that I'm well versed in it -- but at least the low-level notion of an enumerable.
I get the disk_path
thing, but if stored file doesn't have a path natively, then I think it's the responsibility of the party that wants a path to download the data. Where and how that needs to happen can vary, and I don't really want to embed the logic into the adapter.
Where and how that needs to happen can vary, and I don't really want to embed the logic into the adapter.
I get that, but the motivation is that I want to suddenly be able to decide "Actually I don't want to be on disk, I want to be in S3" and have that be as easy and painless a transition as possible.
I get that, but the motivation is that I want to suddenly be able to decide "Actually I don't want to be on disk, I want to be in S3" and have that be as easy and painless a transition as possible.
That's fair. I do wish the temp file stuff in StreamFile could be addressed, but I'm not sure how without impacting disk_path
in a way that undermines the purpose of it. It's sort of indeterminate how long the path is supposed to persist.
We could use Tempfile
maybe, or disk_path
could accept a block and clean up after itself when the block ends π€
We could use
Tempfile
maybe, ordisk_path
could accept a block and clean up after itself when the block ends π€
Tempfile + optional block would work I think.
@tpendragon One more consideration on disk_path
: I want to use x-sendfile to have my web server stream files that are actually on locally mounted file systems. In that case, I don't want to inadvertently download the S3 file to disk and then send it. Is there a workaround?
Hello! I thought I would chime in because I recently ran into a problem where I was using disk_path
and had assume that it was a temp file that would get cleaned up on its own, but it didn't and then my server ran out of space, etc etc. I think part of my issue occurred because we deploying in a containerized environment and can't depend on using a tmp directory that a server might cleanup (but I am far from an expert on this).
I ended up fixing the problem by creating a method just like @tpendragon described. It accepts a block and cleans up once the block ends. At the time spent some time reading the TempFile
documentation and it seemed like it is really encouraging manually closing/removing a TempFile
. Another possible option I considered was creating two methods, one to create a temp file and another one to close it.
My suggestion would be keep disk_path
as is, but add a method to explicitly close the file and add an additional method to take a block.
Just my two cents.
@tpendragon One more consideration on
disk_path
: I want to use x-sendfile to have my web server stream files that are actually on locally mounted file systems. In that case, I don't want to inadvertently download the S3 file to disk and then send it. Is there a workaround?
That's interesting! We do this for disk files and it does fail for our cloud ones. I think you could either proxy through nginx or apache or whatever for these, or I've considered adding a download_url
to StreamFile that no-ops - I think Shrine does something similar?
I ended up fixing the problem by creating a method just like @tpendragon described. It accepts a block and cleans up once the block ends. At the time spent some time reading the
TempFile
documentation and it seemed like it is really encouraging manually closing/removing aTempFile
. Another possible option I considered was creating two methods, one to create a temp file and another one to close it.
Ahhh this is great, good work @cgalarza !
I think I'm gonna look at adding a block to disk_path
and make StreamFile clean up using it, and make close
on StreamFile
ensure any local tmp file is deleted.
@dchandekstark @cgalarza How's https://github.com/samvera/valkyrie/pull/955 look?
@dchandekstark Years ago I converted Figgy to cloud only as an experiment and I remember doing a thing for Downloading - https://github.com/pulibrary/figgy/compare/main...figgy_in_the_cloud might be useful to you?
(We've since decided to keep files local, but it was a good experiment in how flexible Valkyrie really is π )
@dchandekstark Years ago I converted Figgy to cloud only as an experiment and I remember doing a thing for Downloading - pulibrary/figgy@main...figgy_in_the_cloud might be useful to you?
Yes. It also illustrates (in the respond_to? tests) that path is not always (easily) available.
@dchandekstark @cgalarza How's #955 look?
@tpendragon Seems like a good start! I do confess a little concern that explicit temp file naming is potentially unsafe (e.g., what happens when separate threads/processes act on the same file?).
@tpendragon I feel that I should add to this thread that I appreciate what Valkyrie provides with respect to storage. It's simple, yet effective. I can see the attraction to Shrine and even ActiveStorage, but you don't always need or want that much extra functionality associated with your storage capability. With that said, I thought I would try to fit a cloud provider (in this case S3) into the Valkyrie framework more or less directly and adhere to the storage adapter contracts.
I'm all for more options! Looking forward to it!
I am working on a storage adapter for S3 (in a somewhat different fashion from https://github.com/stkenny/valkyrie-storage-s3), and I find that
Valkyrie::StorageAdapter::File
feels awkward in a couple of spots. I wonder whether we could insert a higher level abstraction above it to supportfind_by
in a more "natural" way for cloud providers.First,
#disk_path
IMO extends the responsibility of V::SA::F too far, andStreamFile
feels too opinionated (and appears to create but not clean up temp files?). Likewiserewind
,close
, andread
seem bound to Ruby's IO class -- nothing wrong with that as such, but if you're dealing with a cloud resource, it's not really "on disk" and it's not really an IO either. What I might suggest for a higher level abstraction is a class that minimally implements#each
to yield chunks of the file data; I could also see#stream
to get an IO.#size
is probably fine too, fwiw.Here's a partial illustration, where the
io
used to instantiate the class is an instance ofAws::S3::Object
Interested in any and all feedback. Thanks!