janko / image_processing

High-level image processing wrapper for libvips and ImageMagick/GraphicsMagick
MIT License
869 stars 77 forks source link

Support for true streaming #60

Open jcupitt opened 4 years ago

jcupitt commented 4 years ago

Hello all,

git master libvips has a new feature which might be relevant to the image_processing gem: true streaming. There's a short note about it here:

https://libvips.github.io/libvips/2019/11/29/True-streaming-for-libvips.html

You can now connect libvips image processing pipelines directly to any source or destination and it'll process data in a fully streaming manner. For example, you could fetch bytes from one URI (perhaps an S3 bucket), process, and write to another URI (perhaps a different bucket) in parallel and with no buffering.

We don't have any benchmarks yet, but we hope that this will produce a useful drop in latency. It should be fun to experiment with anyway. I don't think (as far as I know) anyone has done this before.

This branch of ruby-vips adds support:

https://github.com/libvips/ruby-vips/tree/add-stream

You can see examples of the proposed API here:

https://github.com/libvips/ruby-vips/blob/add-stream/spec/stream_spec.rb

Sample:

      streami = Vips::Streami.new_from_descriptor 12
      image = Vips::Image.new_from_stream streami, '', access: :sequential
      streamo = Vips::Streamo.new_to_descriptor 13 
      image.write_to_stream streamo, '.png'

So that'll stream pixels between file descriptors 12 and 13 (perhaps they are pipes connected to S3 URIs), reading any supported image format, and writing as PNG.

You can do anything in the pipeline (blur, rotate, composite, etc.). There's thumbnail_stream which can efficiently thumbnail images directly from a stream source. You can also open streams multiple times, examine the header and adjust parameters without actually processing pixels.

Questions!

  1. Does this look interesting?
  2. How does image_processing hook up to storage systems like S3 at the moment?
  3. Is the proposed API enough, or do you need to be able to make a custom stream class? That would require a little more work on ruby-vips.
  4. I've no idea if support could be added without breaking your API. Perhaps you could allow URIs where you now allow filenames?
  5. Any comments, changes or fixes would be extremely welcome, of course.
renchap commented 4 years ago

I would be very interested in this. I am using image_processing with shrine and direct uploads, so my workflow is: Google Cloud Storage bucket (= image from URL) => image processing with libvips => multiple outputs to Google Cloud Storage (= HTTP requests).

I guess this would need multiple changes from this gem as well as Shrine & Shrine storages, but I think the first step would be to see how much efficient (time spend, memory usage…) this would be.

jcupitt commented 4 years ago

Yes, I wonder how helpful it will be in practice.

Uploads have to give content-length, so you need to finish the result before you can start writing. If the upload is >5MB you can chunk it, but I think most uploads will be smaller than that.

We need benchmarks!

jcupitt commented 4 years ago

I added custom input and output streams to ruby-vips and merged to master.

You can now write this, for example:

class Mystreami < Vips::Streamiu
  def initialize(filename)
    @filename = filename
    @contents = File.open(@filename, "rb").read
    @length = @contents.length
    @read_point = 0

    super()

    signal_connect "read" do |buf, len|
      bytes_available = @length - @read_point
      bytes_to_copy = [bytes_available, len].min
      buf.put_bytes(0, @contents, @read_point, bytes_to_copy)
      @read_point += bytes_to_copy

      bytes_to_copy
    end
end

signal_connect lets you attach a handler that does the read.

You can use it like this:

input_stream = Mystreami.new("some/file")
image = Vips::Image.thumbnail_stream input_stream, 128

There's a more complete example here, with output as well, and implementing seek:

https://github.com/libvips/ruby-vips/blob/master/example/stream.rb

It should be easy to hook it up to something that does read/write to URIs.

janko commented 4 years ago

@jcupitt Thanks a lot for working on this! It definitely sounds exciting, and we might add support for it in the near future.

If I would want to feed data from an IO-like object that already implements IO#read and IO#seek methods into the input stream, and retrieve chunks of processed file data from the output stream, would something like the following work?

source_io #=> some kind of IO-like object
input_stream = Vips::Streamiu.new
input_stream.signal_connect "read" do |buf, len|
  chunk = source_io.read(len)
  buf.put_bytes(0, chunk, 0, chunk.bytesize)
  chunk.bytesize
end
input_stream.signal_connect "seek" do |offset, whence|
  source_io.seek(offset, whence)
end

image = Vips::Image.new_from_stream input_stream, "", access: "sequential"

destination_io #=> some kind of IO-like object
output_stream = Vips::Streamou.new
output_stream.signal_connect "write" do |buf, len|
  chunk = buf.get_bytes(0, len)
  destination_io.write(chunk)
  chunk.bytesize
end
output_stream.signal_connect "finish" do
  destination_io.close
end

image.write_to_stream(output_stream, ".png")

For Shrine it would be a very common use case. In that sense I think it would be very beneficial to simplify this in ruby-vips, for example automatically reading/writing from/to the buffer, so that one could do something like this:

input_stream = Vips::Streamiu.new
input_stream.on_read { |length| source.read(length) } # buffer writing would happen under-the-hood
input_stream.on_seek { |offset, whence| source.seek(offset, whence) }

image = Vips::Image.new_from_stream input_stream, "", access: "sequential"

output_stream = Vips::Streamou.new
output_stream.on_write { |chunk| dest.write(chunk)  } # chunk is automatically read from the buffer
output_stream.on_finish { dest.close }

image.write_to_stream output_stream, ".png"

I'm not suggesting this should be the API, just that eventually having something like this in ruby-vips might lower the barrier entry for using streams significantly. But until the initial streams release in libvips I think the current API would suffice, just wanted to put the idea out there 😃

jcupitt commented 4 years ago

Hiya, yes, your first example should work, from a quick look.

Yes, you need to use FFI get_bytes and put_bytes at the moment. It's ugly, but I think it's necessary if you want to avoid a copy.

It's true, we could add a simple no-FFI layer over it. I think you'd just need:

module Vips
  class Streamiu
    def on_read length, &block
      signal_connect "read" do |buf, len|
        chunk = block.call(length)
        buf.put_bytes(0, chunk, 0, chunk.bytesize)
        chunk.bytesize
      end
    end
  end
end

Or would you need to add a Proc to capture the outer block? I always get confused.

janko commented 4 years ago

Thanks a lot for the review. Yes, your example would work. When libvips 8.9 is released and when we start working on adding the streaming support to ImageProcessing, I can help with adding the extra convenience layer to ruby-vips.

I've one more question. When we pass the content string to put_bytes (2nd argument), does libvips take a copy of it, making it safe to deallocate that string afterwards on the Ruby side (e.g. using chunk.clear)? Same question for chunks retrieved from get_bytes.

I would like to make this streaming memory-efficient by deallocating all content chunks we've allocated for sending to libvips. If this works, then I believe we have everything we need for adding support to ImageProcessing 💪

jcupitt commented 4 years ago

put_bytes is just a thin FFI layer over the C memcpy -- there's no memory alloc or free there.

get_bytes takes a copy, unfortunately, so you're right, we might need to be careful if we want to keep good GC behaviour.

VipsStreamo does simple write buffering, so generally write will be called on 8kb chunks of data (rather than 1000s of tiny writes), which should make it simpler.

jcupitt commented 4 years ago

I added an IO::-style layer, so this now runs:

#!/usr/bin/ruby

require 'vips'

source = File.open ARGV[0], "rb"
input_stream = Vips::Streamiu.new
input_stream.on_read { |length| source.read length }
input_stream.on_seek { |offset, whence| source.seek(offset, whence) }

dest = File.open ARGV[1], "w"
output_stream = Vips::Streamou.new
output_stream.on_write { |chunk| dest.write(chunk) }
output_stream.on_finish { dest.close }

image = Vips::Image.new_from_stream input_stream, "", access: "sequential"
image.write_to_stream output_stream, ".png"

Eg.:

./stream.rb ~/pics/k2.jpg x.png

Thanks for the idea!

janko commented 4 years ago

I added an IO::-style layer, so this now runs:

That's awesome! Thanks for adding it so quickly 😃

get_bytes takes a copy, unfortunately, so you're right, we might need to be careful if we want to keep good GC behaviour.

Did you mean "returns a copy"? I think that totally makes sense, as long as one can deallocate that string once writing it to the destination.

output_stream.on_write do |chunk|
  dest.write(chunk)
  chunk.clear
end

As I understood, the above isn't expected to raise any error within libvips, so I think we're all good 😉

Sorry for not checking these things myself, I haven't gotten to installing libvips master on my Mac yet. I will try to figure out whether I can get Homebrew to install it, so that I can do it in a single command.

jcupitt commented 4 years ago

I build on mac by installing their vips to pull in all the dependencies, then removing it again and doing the usual git clone && ./autogen.sh --prefix=/Users/john/vips && make && make install etc.

I added your chunk.clear trick, and something to stop exceptions being thrown across FFI boundaries.