Q: Different shrine store for different versions?

shrinerb / shrine

File Attachment toolkit for Ruby applications

https://shrinerb.com

MIT License

3.18k stars 275 forks source link

Q: Different shrine store for different versions? #254

Closed jrochkind closed 6 years ago

jrochkind commented 6 years ago

I would like to store my :original "version" in a different store than other versions. For my use case, they will both be on S3, but I'd like different buckets -- I have different lifecycle/backup requirements for the :original (which is irreleplaceble) than for other derivative versions that can be regenerated from original.

Is there any reasonable way to set this up and have everything else work seamlessly once set up? (ie urls to different versions are from appropriate buckets, etc).

janko commented 6 years ago

At the moment Shrine doesn't have a built-in way of using different storage for original file and versions. We're planning to implement better versions management which would allow this naturally.

In the meanwhile you could achieve this by storing versions data in a separate attachment field from the original file. After the original file has been promoted to permanent storage, you can apply the processing:

Shrine.storages = {
  cache:          Shrine::Storage::MyStorage.new(...),
  store:          Shrine::Storage::MyStorage.new(...),
  versions_store: Shrine::Storage::MyStorage.new(...),
}

class ImageUploader < Shrine
  plugin :versions
end

class Photo
  include ImageUploader::Attachment.new(:image)
  include ImageUploader::Attachment.new(:image_versions)
end

class ProcessImageSizes
  include ImageProcessing::MiniMagick

  def call(original)
    versions = {}
    versions[:large]  = resize_to_limit(original,          800, 800)
    versions[:medium] = resize_to_limit(versions[:large],  500, 500)
    versions[:small]  = resize_to_limit(versions[:medium], 300, 300)

    uploader = ImageUploader.new(:versions_store)
    uploader.upload(versions)
  end
end

photo.image = file
photo.image_versions = nil # clear versions when attachment changes
photo.save # original uploaded to permanent storage

versions = photo.image.open { |original| ProcessImageSizes.new.call(original) }
photo.image_versions_attacher.set(versions)
photo.save

jrochkind commented 6 years ago

Thank you for the possible implementation under the current code, and link to the discussion. There's lots to think about there, I've got to read it all a few more times and play with it.

(I find Shrine takes me more time to understand than other solutions, but I think that's a worthy trade-off for it's increased flexibility, which I need for my use cases (I work in digital preservation in libraries)).

Is there any code written yet for the other approach you discuss in that thread? I agree with your conclusions on that thread, I think. At some point I might give it a try, although I'm not sure I'll have time soon.

I'd suggest for backwards compatibility you consider releasing the new architecture as a different plugin, leaving the current versions one the same (and possibly deprecated). Although I guess that'd cause problems if there are other current plugins that depend on current versions api with the array-style access, if the new plugin doesn't have the same API (which I'm not sure it should). Either way if it does end up being a new plugin, I'd suggest the name derivatives rather than versions -- "derivatives" matches the thinking that the original should always be there and untouched in case your processing went wrong, the new things are "derivatives" of it. And "derivatives" is the word used in the digital preservation community, and I think others as well.

jrochkind commented 6 years ago

More thoughts:

It sounds like storing versions in a different DB column is the way you're leaning in your 'ideal' implementation/API too, not just in the current workaround. That makes sense to me.

E.g. they might want to generate some versions in the foreground, and others in the background, and maybe others later in another background job.

Yes. I start to worry about DB update race conditions though -- what if one bg job generates a new version/derivative, and tries to save it in the db -- for which it needs to supply the full hash (or other data structure) for all versions/derivatives, possibly overwriting some other version/derivative change made by another job/thread at just the wrong time to cause a race condition.

Really this is a threat even with the current versions implementation, but explicitly allowing affordances for different versions to be generated at different times might make it worse.

The other trick with the two-column implementation is making sure things are always sync'd properly, that changing an original upload always zeroes out the existing versions/derivatives -- except for any created after the new original of course. And ideally doing this in one db update (as far as the ORM is concerned, and ideally in one actual DB transaction), to minimize race conditions and also ideally make it look like "one" update to the ORM, for any auditing/tracking purposes.

jrochkind commented 6 years ago

Although actually, at least putting versions/derivatives in their own column separate from original will make race conditions on version creation/updating no longer put the original at risk!

janko commented 6 years ago

Is there any code written yet for the other approach you discuss in that thread?

Not yet, unfortunately, I'd welcome pull requests.

I'd suggest for backwards compatibility you consider releasing the new architecture as a different plugin, leaving the current versions one the same (and possibly deprecated).

Yeah, that was my plan 👍

Although I guess that'd cause problems if there are other current plugins that depend on current versions api with the array-style access, if the new plugin doesn't have the same API (which I'm not sure it should).

I think there shouldn't be any plugins in Shrine that really depend on the versions API. I tried to make them versions-agnostic, and keep all versions-specific behaviour inside the versions plugin.

Either way if it does end up being a new plugin, I'd suggest the name derivatives rather than versions -- "derivatives" matches the thinking that the original should always be there and untouched in case your processing went wrong, the new things are "derivatives" of it. And "derivatives" is the word used in the digital preservation community, and I think others as well.

Great, thanks for the suggestion! Yeah, I agree that "versions" is not accurate, and this is a good opportunity to migrate away from that terminology. I was thinking either "derivatives" or "variants" (ActiveStorage). I think "derivatives" is more accurate.

It sounds like storing versions in a different DB column is the way you're leaning in your 'ideal' implementation/API too, not just in the current workaround. That makes sense to me.

Actually, in that Google group thread I was more inclined towards keeping the versions in the same DB column. The idea was to just add a versions top-level key:

{
  "id": "kjsd824249jdg.jpg",
  "storage": "store",
  "metadata": { ... },
  "versions": {
    "large": { ... },
    "medium": { ... },
    "small": { ... },
  }
}

I prefer that users only need to use one DB column. But if there compelling reasons to use two columns, I'm open to the discussion.

Yes. I start to worry about DB update race conditions though -- what if one bg job generates a new version/derivative, and tries to save it in the db -- for which it needs to supply the full hash (or other data structure) for all versions/derivatives, possibly overwriting some other version/derivative change made by another job/thread at just the wrong time to cause a race condition.

The same thing occurred to me. At the moment Shrine's backgrounding plugin tries to minimize the chance of race conditions by retrieving the attachment after versions have been processed and uploaded, and checking whether it has changed. It would definitely be nice to be able to make adding versions thread safe, I'm open to ideas.

If we don't figure it out, I think it should still be ok, because it's not really a common requirement to have two background jobs adding versions for the same attachment.

The other trick with the two-column implementation is making sure things are always sync'd properly, that changing an original upload always zeroes out the existing versions/derivatives -- except for any created after the new original of course. And ideally doing this in one db update (as far as the ORM is concerned, and ideally in one actual DB transaction), to minimize race conditions and also ideally make it look like "one" update to the ORM, for any auditing/tracking purposes.

If we decide for the two-column implementation, then I definitely agree 👍

jrochkind commented 6 years ago

Something about the two-column storage somehow seems kind of right to me, but I'm not sure why. I like keeping the original pristine with just original stuff, and the derivatives off in their own column. I guess it would make it easier to track audit history for just the original and not the derivatives, which I think might be a use case. Might it make it easier to have the derivatives use a different storage back-end than the original, without it being super confusing, and allowing re-use of a lot of your existing architecture? (that is definitely a prime use case, what began this thread!)

But it's just kind of a gut feeling, i'm not sure. If it makes other implementation significantly more complex, then perhaps that would be a reason to scrap it.

I don't think it's a developer-user usability concern, I don't think it's any "harder" to add a second column for derivatives, I don't think that's a big deal.

Not sure about the race conditions, have to think more about it. With any hash storage (and without assuming postgres or access to postgres' atomic hash-key update, which AR doesn't really support, dunno about Sequel), one could imagine a kind of optimistic locking where the client doesn't just notice there's been a change and aborts, but instead actually notices there's been a change and merges the desired changes on top of the already changed hash and saves again.

janko commented 6 years ago

@jrochkind What you say are definitely compelling reasons to use two DB columns.

I don't think it's a developer-user usability concern, I don't think it's any "harder" to add a second column for derivatives, I don't think that's a big deal.

At first I thought it might be annoying to have to add the second DB column, but I agree that it shouldn't be a big deal. It would make handling derivatives a completely independent addition to the regular unchanged upload, which I'm attracted to. Thanks so much for sharing your point of view!

Not sure about the race conditions, have to think more about it. With any hash storage (and without assuming postgres or access to postgres' atomic hash-key update, which AR doesn't really support, dunno about Sequel), one could imagine a kind of optimistic locking where the client doesn't just notice there's been a change and aborts, but instead actually notices there's been a change and merges the desired changes on top of the already changed hash and saves again.

Yeah, we should take the possibility of doing atomic hash-key update on JSONB columns into consideration (for text columns this would fall back to regular replace-all update). I think we'll have a better idea how to handle race conditions once we get a working implementation of the derivatives plugin.

jrochkind commented 6 years ago

There may be a way to eliminate race conditions even with text columns, but I don't understand how to make it work with shrine's ORM-agnostic architecture -- and then figuring out if it can be done efficiently in any db too. But ruby pseudocode:

   def add_derivative(key, value)
        current_derivatives_hash = fetch_current_derivatives_hash # from db

        # DIY one-column optimistic locking, update the column only if it's current value
        # is what we expect, hasn't been changed by someone else. Relies on ability to do
        # a "where" on the hash serialization ; may require db-appropriate indexing
        update(derivatives: current_derivatives_hash.merge(key: value)).where(derivatives: current_derivatives_hash)

        # if updated 0 rows, optimistic locking failed, go back to top and try again,
        # until you've tried too many times and give up with an exception
  end

Alternately, if you add another column for derivatives_updated_at, you don't need to do a where on the serialized hash, you can do a where on the derivatives_updated_at. (and of course update it to now in the update.) (Ideally db has microsecond precision on date columns, which Rails now supports if it does).

Just some ideas. At some point I'd love to take a stab at the derivatives plugin, but I can't say for sure if I'll have time soon or not (that is, can't at the moment say for sure when/if I'll get to the point I need it for my local work. :) ).

I do see a use case for different bg jobs adding different derivatives (in my domain there are some awfully expensive derivatives sometimes -- if you're starting with a 250MB TIFF, every derivative is expensive :) ), so there are race condition concerns.

I think need to add a fixed number (one or two) more columns for derivatives plugin with all the derivatives you want (not one per-derivative!) is a pretty small inconvenience for good semantics, you do it once in a migration and you're done, I don't think it's a major barrier.

janko commented 6 years ago

There may be a way to eliminate race conditions even with text columns, but I don't understand how to make it work with shrine's ORM-agnostic architecture -- and then figuring out if it can be done efficiently in any db too. But ruby pseudocode:

Yeah, that would be the correct "SQL way" to avoid race conditions. Initially I wanted to do it for Shrine's promotion in general (which has the same behaviour regardless of whether you're using versions or not), but that change was backwards incompatible for anyone relying on callbacks to get executed. But with the new derivates plugin we can switch to the UPDATE statement, because it's a new feature and because the API would be more explicit so people shouldn't need to use callbacks.

Another issue that needs to be solved is detecting when the attachment has changed, as it's possible that the user replaces the current attachment with a new one before the versions have finished processing. Probably shouldn't be too difficult, though.

Alternately, if you add another column for derivatives_updated_at, you don't need to do a where on the serialized hash, you can do a where on the derivatives_updated_at. (and of course update it to now in the update.) (Ideally db has microsecond precision on date columns, which Rails now supports if it does).

Yeah, let's keep that as a secondary option, I would just need to give it some more thought.

At some point I'd love to take a stab at the derivatives plugin, but I can't say for sure if I'll have time soon or not (that is, can't at the moment say for sure when/if I'll get to the point I need it for my local work. :) ).

That would be great! 😃

I do see a use case for different bg jobs adding different derivatives (in my domain there are some awfully expensive derivatives sometimes -- if you're starting with a 250MB TIFF, every derivative is expensive :) ), so there are race condition concerns.

Ok, then we definitely want to support this use case.

I think need to add a fixed number (one or two) more columns for derivatives plugin with all the derivatives you want (not one per-derivative!) is a pretty small inconvenience for good semantics, you do it once in a migration and you're done, I don't think it's a major barrier.

Yes, I agree. There is a separate discussion happening on the Google group, where Radovan was suggesting we should have one record for each derivate (kind of like ActiveStorage does). I'm not entirely sold on this idea yet, as I think it will make the implementation more complex and less chance to retain backwards compatibility (e.g. should we then have the original file also in a separate record?). It might also be a performance problem when having lots and lots of derivates, e.g. with video transcoding or splitting PDFs. But it should make it easier to handle race conditions. We could probably move the discussion there.

jrochkind commented 6 years ago

Thanks for your thoughts @janko-m ! I definitely plan to get to this in the medium-term, next couple months, it fits into my plans/needs at my day job. What I end up doing might be fitted for my particular use case and not (at least initially) meet needs universally, but hopefully it'll at least be a start.

I'm going to throw one additional difficult spec/question in for ya. You gave a good start to creating versions/derivatives that are in their own column (which I like just for itself), and thus support a different storage for versions/derivatives. What if we want (optionally) a different storage per derivative, like some in one storage some in another, while still all derivatives being in a hash in one column? Any ideas of how to use the shrine parts to achieve that? Exactly how storage settings interact with shrine internal code is something I have yet to wrap my head around.

Cause I tend to agree with your initial thoughts that one-column-per-derivative is too much, and a separate persisted object per derivative is really too much.

jrochkind commented 6 years ago

PS: Other spec I want my solution to handle: You should actually tell "the code" how to make each derivative, some api like derivative :whatever { |stream| block_that_creates_deriv }, or derivative :whatever, ObjectThatCreatesDeriv.new or whatever.

Which will make it a lot more straightforward to "create all derivatives that don't exist yet", or "force re-create derivative X on all existing records, because I changed the definition", without having to duplicate code or come up with your own architecture on top.

janko commented 6 years ago

What if we want (optionally) a different storage per derivative, like some in one storage some in another, while still all derivatives being in a hash in one column? Any ideas of how to use the shrine parts to achieve that?

This will be easy, because for each uploaded file has the storage name stored in the database, which identifies where the file was uploaded. That way when Shrine loads the Shrine::UploadedFile object from the database, all its method like #url, #delete and others will operate on the storage assigned to that uploaded file.

# initializer
Shrine.storages[:other_store] = Shrine::Storage::MyStorage.new(...)

# uploaded file data
{
  "id": "fsoww8e9ro39s.jpg",
  "storage": "other_store",  # <==== storage
  "metadata": {
    # ...
  }
}

This is how the original file and each version are currently stored, and it's how derivates will be stored as well. This means the derivates API just needs to allow overriding the destination storage for each derivative (something that the current versions plugin doesn't have).

Exactly how storage settings interact with shrine internal code is something I have yet to wrap my head around.

IMO a good place to start would be looking at the implementation of the Shrine::UploadedFile class, which is initialized with a hash like the one I've shown above.

Other spec I want my solution to handle: You should actually tell "the code" how to make each derivative, some api like derivative :whatever { |stream| block_that_creates_deriv }, or derivative :whatever, ObjectThatCreatesDeriv.new or whatever.

Which will make it a lot more straightforward to "create all derivatives that don't exist yet", or "force re-create derivative X on all existing records, because I changed the definition", without having to duplicate code or come up with your own architecture on top.

I'm generally reluctant to add any sort of DSL, especially for processing. That's something that I really wasn't a fan of when I was using CarrierWave. I think it reduces flexibility, because it won't be easy let's say to use information from the controller to affect which versions will be generated. The DSL would then need to support how to generate a derivate from an existing derivate, conditionals and so on.

I would like to keep the "one block of instance-level code" approach if it's possible. Lately I was thinking that using a Shrine::UploadedFile#download block could be the idiomatic way of generating derivates:

photo.image.download do |original|
  pipeline = ImageProcessing::MiniMagick.source(original)

  derivates = {
    large:  pipeline.resize_to_limit!(800, 800),
    medium: pipeline.resize_to_limit!(500, 500),
    small:  pipeline.resize_to_limit!(300, 300),
    # ...
  }

  photo.image_attacher.add_derivates(derivates)
  # or
  derivates.each do |name, file|
    photo.image_attacher.add_derivate(name, file)
  end
end

So my current idea is not to define processing inside the uploader, but rather that the user can generate derivates dynamically in the controller. That way they can easily use any variables from the controller, they don't need to worry about how they're going to pass it in to the uploader.

jrochkind commented 6 years ago

Awesome, thanks for the hints about different store per-version, that seems prommissing!

I'm generally reluctant to add any sort of DSL, especially for processing.

I understand your point. But I think "I added a new version and need to add it to all files" is a common use case, as is "I changed the definition of a version and need to re-gen it for all files".

If I understand things right, and I may not, currently accomplishing those use cases -- without regenerating all versions -- requires manual intervention, un-DRY copy-and-paste coding, or a custom architecture you need to think through and apply on top. https://github.com/shrinerb/shrine/blob/master/doc/regenerating_versions.md#reprocessing-a-single-version

I think that's unfortunate, it's such a common use case (at least for me), that there should be an easy way to say "regenerate just version X", or "generate just version X only if it's not already stored, for each record". Which I think requires shrine to know how to generate "version X", not just "all versions". Which I think requires some API that defines how to create a specific version (whether you call it a "DSL" or not, I prefer to just think of it as API :) ).

Perhaps it can/should allow both, a way to specify a specific version creation logic, as well as still a generic method you can use when that doesn't work for you.

Or perhaps the plugin that will meet my needs does not match your vision for a generic plugin, and I'll just end up with my own, either independent or on top of an official shrine one.

jrochkind commented 6 years ago

Right, either un-DRY, or you have to come up with your own architecture on top of shrine, for what I think is a very common use case. So I'm definitely interested in making this clean and painless out-of-the-box in a versions/derivatives plugin, but understand it may not be the one that gets endorsed by shrine.

jrochkind commented 6 years ago

I'd like to work on this ideally for contribution to shrine as a plugin taking care of it for you. More thoughts and questions posted to google group

janko commented 6 years ago

Thank you @jrochkind, that would be greatly appreciated! I will then close this issue in favour of moving the discussion to the google group.