Fast way to write pyvips image to FFmpeg stdin (and others; suggestions; big read)

Tremeschin commented 4 years ago

Hey @jcupitt, looks like you're very active helping people here so I thought it wouldn't hurt asking something. This will be a bit long as I'll give you the full context and wanted some suggestions if possible.

I'm working on a project in making a music visualizer and unsurprisingly I'm dealing with lots and lots of image processing, gaussian blur, vignetting, resize, alpha composite, etc.

While numpy + PIL works they aren't that much fast comparing to a proper image library or some GPU accelerated canvas (this last one woudn't quite work because I have to move back and forth a lot of images, and textures are somewhat expensive so I'd have to process them on the GPU itself, I'm not that deep into it yet)

For a 8.5 second audio using non multiprocessed code it takes about 3 minutes to make a 720p60 video out of it, I thought: hmm I don't have only one thread on my CPU so multiprocessing would work better, right? no! the IPC and mutexes shenanigans I wasn't aware about didn't scale up much, it cut in half the processing time down to 1m30sec though.

I tried using something like NodeJS and thought using C++ with Python to alpha composite and process the images very fast, the first one didn't quite work and haven't tried the second one yet.

So I stumbled across pyvips and with little effort I could alpha composite 100 particle images (random coordinates) onto a 4k image at 84 fps!! and it didn't even use much RAM and only half the CPU.

Though when piping the images to FFmpeg we have to convert them into a format we can write to the stdin and be readable + compatible by FFmpeg and its arguments.

Here comes my question after this short context: when I use something like

image.write_to_buffer('.JPEG', Q=100)
image = np.ndarray(buffer=image.write_to_memory(), dtype=format_to_dtype[image.format], shape=[image.height, image.width, image.bands])

Piping the images takes about 0.0764 seconds in total each cycle of making + converting to bytes + write to stdin but those two example lines takes about 0.0617 seconds to run. (those numbers are average of 510 piped frames). That's nearly all the time spend on the loop.

I'm not sure of how I can ask this but am I doing something wrong or is there an better way of getting the full fat raw image out of a pyvips Image object and send to FFmpeg stdin?

Again, this is my main bottleneck so any advice on quickly converting the images to a video is what I need.

I use somewhat canonical piping arguments (with ffmpeg-python package)

self.pipe_subprocess = (
    ffmpeg
    .input('pipe:', format='image2pipe', pix_fmt='rgba', r=self.context.fps, s='{}x{}'.format(self.context.width, self.context.height))
    .output(output, pix_fmt='yuv420p', vcodec='libx264', r=self.context.fps, crf=18, loglevel="quiet")
    .global_args('-i', self.context.input_file)
    .overwrite_output()
    .run_async(pipe_stdin=True)
)

I only change image2pipe with rawvideo when piping the numpy array's raw data.

I've seen and read a few places before asking this, most notably:

And I've tried lurking the docs on pyvips.Image class.

I'm looking forward using this library for images from now on, it works really well and is VERY simple to use.

I almost cried when I saw the Image.composite method, that is because I had manually implemented something equal to this by hand here (spoiler: it took a while to crop and composite only the needed parts)

And looks like pyvips can handle big images like they are nothing!!

Thanks for the project for using libvips easily through Python.

Tremeschin commented 4 years ago

Just found out about this Most efficient region reading from tiled multi-resolution tiff images #100, so it looks like I'm creating many pipelines on libvips, I'll rewrite the code to blit into a base canvas the current setup I'm testing instead of copying a base image.

Tremeschin commented 4 years ago

Yes last linked issue is very promising though I'm getting black images from the region, investigating but speeds are drastically improved now, about as fast as without converting to jpg and saving to ffmpeg

Tremeschin commented 4 years ago

Ok so while I can pipe images to FFmpeg at 40 fps now using this fetch method from a pyvips.Region I'm a bit confused on how I can make it work to this specific case, I'll explain what worked and what didn't as well as what I think would solve this.

I saw the code on vregion.py and it inherits from a pointer object in the __init__ method (super(Region, self).__init__(pointer)), this is called by the Region.new method if you give a valid image pointer to it.

The way my code will work is sequentially alpha composite, resize multiple images so for example I'd do something like

canvas = canvas.composite(background, 'over', x=0, y=0)

And just for testing I'm putting some particles

Welp talk is cheap show me the code!!

from cmn_video import FFmpegWrapper
from PIL import Image
import random
import pyvips

# Images to work with
background = pyvips.Image.new_from_file("walp1080.jpg").copy(interpretation="srgb")
particle = pyvips.Image.new_from_file("particle.png").copy(interpretation="srgb")

# Start with a zeros canvas
canvas = pyvips.Image.black(width, height, bands=3).copy(interpretation="srgb")
canvas_region = pyvips.Region.new(canvas)  # Region for fetching images

# Video variables
width = 1920
height = 1080
fps = 60

# 8.5 seconds audio file
nframes = int(8.5*fps)

# Number of particles for stressing a bit the code
nparticles = 100

# My FFmpeg wrapper class, only need to pay attention to write_to_pipe and pipe_one_time methods
# You can get this class on a blob linked below

# Needs Context and Controller to operate, Context are "non changing vars" and controller are dynamic vars

class Context:
    def __init__(self):
        self.fps = fps
        self.width = width
        self.height = height
        self.input_file = "banjo.ogg"
        self.watch_processing_video_realtime = False

class Controller:
    def __init__(self):
        self.total_steps = nframes
        self.core_waiting = False

# Create FFmpegWrapper class
ctx = Context()
con = Controller()
f = FFmpegWrapper(ctx, con)

# Start the pipe
ff.pipe_one_time("out.mkv")

# Thread to write images to FFmpeg, 8.5 seconds audio (don't need to be exact, just for stats)
threading.Thread(target=ff.pipe_writer_loop, args=(8.5,)).start()

# Loop through each frame
for index in range(nframes):

    # Add the background
    canvas = canvas.composite(background, 'over', x=0, y=0)

    # Composite nparticles on random parts of the image
    canvas = canvas.composite(
        [particle]*nparticles, 'over',
        x=[random.randint(0, width) for _ in range(nparticles)],
        y=[random.randint(0, height) for _ in range(nparticles)]
    ).gaussblur(3)

    # Get the pixels from canvas
    patch = canvas_region.fetch(0, 0, width, height)

    # Convert buffer region.fetch to PIL Image
    image = np.ndarray(buffer=patch, dtype=format_to_dtype[canvas.format], shape=[canvas.height, canvas.width, 3])
    image = Image.fromarray(image)

    # Write this image at that index on final video
    ff.write_to_pipe(index, image.tobytes())

ff.close_pipe()

While this code is mostly for demo what I did it'll not run because it's missing the FFmpegWrapper() class

I made a gist that removes MMV parts of the code if you want to test this code snnipet.

Welp here comes the issue I'm not being able to find a fix for.

The pyvips.Region.new(canvas) is kinda a "pointer" to that pyvips Image class so when we do canvas = canvas.composite, according to the documentation, it returns an Image class as well but if you do for example on the background line:

print(" Before", id(canvas))

# Add the background
canvas = canvas.composite(background, 'over', x=0, y=0)

print(" After", id(canvas))

The ids will change, the memory address of the canvas var will be another one as far as I understood.

So when I put the original canvas variable as that background image, it does pipe the images correctly but only sends the original background attributed to the canvas as we just did.

If I try to reatribbute the Region to the canvas on the main loop it usually hangs and leaks lots of memory.

I wonder where I'm right / wrong here, if there's an way to swap the two images or keep the reference from the Region for fetching afterwards after a .composite and whatnot funcions.

Writing this in details just in case somebody stumbles across this in the future will have a gold mine :)

jcupitt commented 4 years ago

Hello @Tremeschin,

This sounds very interesting.

Piping the images takes about 0.0764 seconds in total each cycle of making + converting to bytes + write to stdin but those two example lines takes about 0.0617 seconds to run. (those numbers are average of 510 piped frames). That's nearly all the time spend on the loop.

libvips is a lazy image processing library, so yes, all the processing happens on the final write.

When you run a series of operations on an image, they don't execute, instead they append nodes to a large graph structure that libvips maintains behind your back. When you connect to the final output (a memory area here), the whole graph executes at once using all of your CPUs (hopefully).

When you run region fetch, it will pull just a single small patch from a pipeline using just one thread. It skips all the multiprocessing and buffering setup and teardown, so it can be faster if you are working with small patches (eg. 64 x 64 pixels). It'll be slower than the usual .write_to_memory() or whatever for large images, since it can't use more than one thread.

You need to make and then render a new pipeline for each frame, something like this:

background = pyvips.Image.new_from_file("walp1080.jpg")
particle = pyvips.Image.new_from_file("particle.png")
for index in range(frames):
    canvas = background

    canvas = canvas.composite(
        [particle] * nparticles, 'over',
        x=[random.randint(0, width) for _ in range(nparticles)],
        y=[random.randint(0, height) for _ in range(nparticles)]
    )
    canvas = canvas.gaussblur(3)

    ff.write_to_pipe(index, canvas.write_to_memory())

You can send write_to_memory directly to ffmpeg, I think. I guess width x height is 1920 x 1080, so it'll be much faster than fetch.

jcupitt commented 4 years ago

... I've done things slightly like this using webgl, I expect you've had a look at that already.

Tremeschin commented 4 years ago

@jcupitt Yes I have tried many stuff (maybe not directlry webgl) but really unlucky with most, pyvips and libvip are currently the most promising ones. I tried:

It all failed because either the inter process communication between Python and the other languages or just didn't had all the features I needed and my knowledge about them :(

Edit: Sorry didn't notice the comment before your last one, the write_to_memory will try it right now!

Tremeschin commented 4 years ago

libvips is a lazy image processing library, so yes, all the processing happens on the final write.

Thanks for this insight, this cleared a lot of stuff in my head.

As for your suggestion in using ff.write_to_pipe(index, canvas.write_to_memory()) directly rather than a Region and fetch, it did work after switching FFmpeg's pix_fmt="rgba", no issues in the final video.

Performance wise, took 48 seconds to finish a 1080p60 8.5 sec video, a really awesome result not gonna lie, would take about 1m30sec with 4 Python multiprocessing Process workers on old code though I'm not converting svgs here or resizing a lot of images.

Tremeschin commented 4 years ago

I've two other questions (maybe not completely related to the original issue tell me if you want to open new issues on both)

Can I apply vignetting effect? blacking out the borders with a center_x, center_y and (deviation? sigma?) on x and y directions? edit: I can make this through cv2 if it's not possible with pyvips / libvips, it would add some overhead converting image between the two.
If I resize an image and it bleeds quite a bit out of the edges because its resolution will be higher, can I say to pyvip to blit into a negative coordinate and I only get a crop of the region I want?

In other words, get a big enough "canvas" to work with.

I saw something similar across those links I referred to in the first comment, though I was a bit confused on how that'd work (haven't tried doing so yet so just asking a question if that's possible)

Btw I saw you made some changes on the code, it was those temporary quickly R&D codes so stuff isn't implemented the way I'd do :)

jcupitt commented 4 years ago

It sounds like you've looked around quite a bit. I think processing.js is the main webgl wrapper people use for this kind of thing, eg.:

https://therewasaguy.github.io/p5-music-viz/

Vignetting: sure, make a mask and multiply your final image by that before the write. xyz will make an image where pixels have the value of their coordinates -- eg.:

x = pyvips.Image.xyz(width, height)
# move origin to the centre
x -= [width / 2, height / 2]
# distance from origin
d = (x[0] ** 2 + x[1] ** 2) ** 0.5
# some function of d in 0-1
vingette = (d.cos() + 1) / 2
# render to memory ready for reuse each frame (you don't want to render this image every frame)
vingette = vingette.copy_memory()

Then to apply the mask:

    ff.write_to_pipe(index, (canvas * vingette).cast("uchar").write_to_memory())

Bleeding: yes, have a look at embed, it'll add a border to an image, copying or reflecting the edges. Do that before the resize, then crop the result.

Tremeschin commented 4 years ago

Thanks, I'll try these tomorrow as well as making SVG loading working for me from svgwrite's raw svg string

(didn't test svg "native" data like points, lines, only tried rendering a image from the disk, actually that was the first thing I tried with pyvips)

It sounds like you've looked around quite a bit.

Aha!! I've been implementing (and tryharding) what comes to my mind and inspiration from other visualizers in that project so I lurked a lot lately and got a good intuition of stuff like alpha composite and parallel computing (still lack from some proper knowledge).

render to memory ready for reuse each frame (you don't want to render this image every frame)

RIP. the modular music visualizer code updates vignetting each frame, interpolates based on the last value as well as other things like gaussian blur, resize of the logo, FFT visualization radial bars around logo for each channel of the audio, etc..

There's a demo video on the repository readme if you wanna see The trouble for porting to pyvips I've put myself into :laughing:

At least performance seems to be way faster than my current implementation (for higher resolutions mainly), will only sort out this over-resize method and loading an svg from a string for finally starting to port pyvips into the project.

Will let you know how it goes, thanks again for your insights and references, pyvips is now even more promising!!

jcupitt commented 4 years ago

pyvips has a very fast SVG loader, did you see?

x = pyvips.Image.svgload_buffer(b"""
<svg viewBox="0 0 200 200">
  <circle r="100" cx="100" cy="100" fill="#900"/>
</svg>
""")

Tremeschin commented 4 years ago

Yes I was a bit dumb and tried with images on disk rather than simple circles at first, I can easily get this svg string from svgwrite module, will test things properly tomorrow, a bit late for me now :)

Tremeschin commented 4 years ago

Thanks for you help, I got some interesting results testing these today though speeds weren't like 4x faster but merely 2x (the speed gain I'd get by applying some alternative multiprocessing under Python at cost of huge memory usage).

My best guess is that I'm limited by the CPU processing power itself because there is a HUGE amount of pixels being computed on each frame of the final video (resize, vignetting, alpha composite).

I'm not throwing away pyvips completely as it's very very memory efficient and is a 100% Python solution, will close this issue as the main question was addressed and if I need any more help I'll be commenting here back to you!!

Will be looking nto the webgl you mentioned in the upcoming days as GPUs are better on these rendering methods if the code is implemented the right way (pitfalls are easy to fall on tho), seems fun and fast :)

jcupitt commented 4 years ago

I think you're probably spending most of your time the composite. It's a complicated operation and I'm sure it could be optimized a bit more. We'd need to make a benchmark representing your use-case and profile it. For example, for each output region, it needs to compute the subset of the input images which overlap that rect, and at the moment this is a simple linear search. This is fine for perhaps 10 input images, but once you have a few hundred, it could start to become significant. A simple 2D map would remove that.

I mentioned webgl because I wrote a silly game to teach myself:

https://github.com/jcupitt/argh-steroids-webgl

If you try the game, there's a big explosion when your ship finally dies. It animates 10,000 large alpha blended particles at 60fps on a basic smartphone. It'll easily do 1m on a desktop, change this line for max particles:

https://github.com/jcupitt/argh-steroids-webgl/blob/gh-pages/particles.js#L271

And this line to set the size of the big explosion:

https://github.com/jcupitt/argh-steroids-webgl/blob/gh-pages/particles.js#L498

Tremeschin commented 4 years ago

I've profiled the code in the past (should've talked about that here but forgot), most of the time is spent on resizing and gaussian blur, those are pretty expensive operations for a high res image for the CPU to keep up and that require quite a bit of memory under PIL or cv2.

Will take a look in those links you sent as well to get some inspiration, here we go again breaking things

jcupitt commented 4 years ago

Are you using orc, by the way? libvips uses it as a runtime compiler, and it makes things like gaussblur a lot quicker. It helps resize a bit too.

16:45 $ time vips gaussblur nina.jpg x.jpg 20
real    0m2.221s
user    0m7.990s
sys 0m0.078s
✔ ~/pics 
16:45 $ time vips gaussblur nina.jpg x.jpg 20 --vips-novector
real    0m3.353s
user    0m11.942s
sys 0m0.090s

Tremeschin commented 4 years ago

Hmmm not sure when I get to the computer will see it, I'll let you know

Tremeschin commented 4 years ago

I took a day off from coding and yup, looks like I'm using orc.

I'll be properly implementing pyvips on a branch and see how it goes, embed was exactly what I needed and is even a cleaner solution that will simplify lots of lines in the code base.

webgl + Python seems tricky, don't really want to code from scratch or deal with IPC anymore honestly :(

By the way I found a typo on the documentation here, it says "vertcial" on the direction parameter description :)

Tremeschin commented 4 years ago

I have another question, can I set a anchor point to the image for alpha compositing, rotating (haven't tried yet), resizing? Maybe I only need for alpha compositing the image though.

I can calculate this by hand easily by just taking the pixels difference on the resize, dividing by two and adding a offset but having a center point for everything would be much simpler coding so.

Or in the case of the alpha composite itself just calculate the top left point according to the image scale, width, height (I'll do this for now as it's the final processing I need to do and just don't care about the other transformations)

Tremeschin commented 4 years ago

What about forcing a resize to a certain resolution (width, height)? .resize only scales it by a factor and it isn't very clear in my head how that'd work.

Guess it's based on vscale (float) – Vertical scale image by this factor, perhaps even use the thumbnail functions?

Edit: ah yes it was with the thumbnail method, I thought it was just for file but we have thumbnail_image

jcupitt commented 4 years ago

Oh heh I fixed the typo, thanks!

No, there are no anchor points for composite.

You can use thumbnail to very quickly load + resize to pixel dimensions in one operation, or thumbnail_image as you say.

You can calculate the scales for resize as eg.:

y = x.resize(target_width / w.width, vscale=target_height / x.height)

Tremeschin commented 4 years ago

How can I .composite with a transparency / opacity? Or in other words, multiply the alpha channel of a image by a constant?

I found on the wiki this code snippet:

def brighten(filein, fileout):
     im = pyvips.Image.new_from_file(filein, access="sequential")
     im *= [1.1, 1.2, 1.3]
     # if it is a jpg force a high quality
     im.write_to_file(fileout , Q=97)

But looks like it only unpacks the RGB and not the alpha channel of the image as in doing im *= [r, g, b, a], maybe I'm missing some casting, band joining here?

The premultiply method on Image documentation wasn't very clear how to use it if it ever applies this "opacity".

jcupitt commented 4 years ago

You can just scale the alpha before passing the image to composite, eg.:

image *= [1, 1, 1, 0.5]
y = x.composite(image, "over")

Tremeschin commented 4 years ago

Wait, wut that yielded very strange images and black became white, can be a issue with my FFmpeg settings of pixel format, will investigate this as you said it works this way I was using !!

jcupitt commented 4 years ago

You'd need to post a specific example.

libvips / pyvips

Fast way to write pyvips image to FFmpeg stdin (and others; suggestions; big read) #198