laggykiller / rlottie-python

A ctypes API for rlottie, with additional functions for getting Pillow Image.
https://rlottie-python.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
15 stars 1 forks source link

Performance question/ discussion/ thoughts? #6

Closed FredHappyface closed 5 months ago

FredHappyface commented 9 months ago

Hi, thank you very much for creating this library. Having something more portable than pyrlottie is awesome, as that's something I was kind of struggling with.

I've noticed that while there's a massive win in terms of cross-platform use, it seems to come at a slight performance cost of about 5%.

Giving myself a refresher on what pyrlottie does versus how I'm using rlottie-python:

I'm using asyncio.create_subprocess_shell with asyncio.Semaphore(multiprocessing.cpu_count()) to call the binaries directly.

With rlottie-python, I'm using concurrent.futures.ProcessPoolExecutor(max_workers=threads) as the executor.

To be honest, I'm not sure if this is a Windows issue. I'm curious if you've noticed any difference in performance between the libraries?

I think if we are in a place where performance is almost equal, then I'd like to start pointing people to this library and offer any dev support, rather than both of us maintaining two libraries that do the same thing.

I realize this is a bit of a brain dump, but I'd be really keen on hearing your thoughts. :)

Thank you

laggykiller commented 9 months ago

How did you test the two libraries? pyrlottie is slower in Windows and Linux from my testing...

Testing using this file: https://github.com/laggykiller/rlottie-python/blob/master/example/sample.tgs

test.py

from rlottie_python import LottieAnimation

anim = LottieAnimation.from_tgs("sample.tgs")
anim.save_animation("test.gif")

test-pyrlottie.py

import pyrlottie

pyrlottie.run(pyrlottie.convSingleLottie(pyrlottie.LottieFile("sample.tgs"), ["test.gif"]))

Testing on Windows: Measure-Command { python .\test.py } vs Measure-Command { python .\test-pyrlottie.py }

Testing on Linux: time python ./test.py vs time python ./test-pyrlottie.py

It would be nice if you can provide the code and command you used for profiling both libraries. I recommend py-spy for investigating culprit.

With rlottie-python, I'm using concurrent.futures.ProcessPoolExecutor(max_workers=threads) as the executor.

Are you saving frames in tgs to image files using multiprocessing? Perhaps this is slow because spawning a new python process is slow? Besides the act of spawning a new python executor for each process, each new python process would need to load a copy of the file it is converting and the rlottie library (Memory is not shared between python processes), which creates overhead? Maybe the task of saving one frame of the animation is too small to have benefit from multiprocessing (https://stackoverflow.com/questions/68892839/how-to-overcome-overhead-in-python-multiprocessing)? Also please experiment with different number of threads (number of threads = number of CPU cores might not yield the best result)

Also, from running py-spy record -o result.svg -- python test.py I noticed rlottie-python spent much of the time on saving the rendered frames with Pillow. Perhaps saving rendered frames with Pillow is slower than gif2webp. Unfortunately rlottie library itself does not provide any functions that allow me to save frames to file on C side, and I am forced to use python Pillow for this task. I could rewrite the whole project to use another binding library (such as nanobind), then write binding functions that save rendered frames with C++ instead of using python, but this is too much work for me and the benefit is too small, plus you cannot manipulate the frames before saving with this method.

two libraries that do the same thing

User may want to manipulate the frames read from lottie file with python before saving. If user uses rlottie-python, the user eventually need to run the 'slow' process of saving rendered frames using Pillow anyway. If user uses pyrlottie, there is no way to read frames from lottie files directly, user has to first save a file of rendered frames (e.g. file.gif) and then reading that file with python, which is a great performance penalty. Hence, pyrlottie is better when just converting lottie file to raster image (e.g. gif file), but worse for loading lottie file to python for manipulating frames.

Also, pyrlottie uses lottie2gif and gif is a lossy format, meaning it is not possible to get the lossless rendered frame using pyrlottie.

more portable than pyrlottie

Speaking of portability, lottie (https://pypi.org/project/lottie/) is the winner as it is pure python implementation (though this probably also mean slower), as well as having more functionality such as supporting many input/output formats and vector graphics, but the rendered frames are sometimes buggy. I really hope that project can fix the problem of rendering buggy frames...

btw1 I am curious with why macOS build of pyrlottie is not available? lottie2gif is an executable from compiling rlottie, and gif2webp already has macOS build. Is it because you don't have mac machine available? If this is the case, I could help you compile (Though this is not sustainable and you cannot trust me). However, the best way is to use github action for compiling rlottie when building wheel (btw storing precompiled binary in git repo is bad idea, it is better to compile rlottie when building wheel).

btw2 execution bit is lost not just in WSL, but also Linux in general. Maybe update Readme about this? Or even better, check for permission and run chmod +x in your library.

FredHappyface commented 5 months ago

Hi so sorry it's taken so long to write a response. This one kinda fell down the back of the sofa on my todo list

So I've ran the following test


import concurrent.futures
import multiprocessing
import time
from typing import Callable

from pyrlottie import FileMap, LottieFile, convMultLottie, convSingleLottie, run
from rlottie_python import LottieAnimation

def rlottie_py():
    anim = LottieAnimation.from_tgs("sample.tgs")
    anim.save_animation("test.gif")
    anim.lottie_animation_destroy()

def py_rlottie():
    run(convSingleLottie(LottieFile("sample.tgs"), {"test.gif"}))

def rlottie_py_mult():
    def convert_single_tgs():
        anim = LottieAnimation.from_tgs("sample.tgs")
        anim.save_animation("test.gif")
        anim.lottie_animation_destroy()

    with concurrent.futures.ThreadPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
        # Using list comprehension to submit tasks to the executor
        future_to_variable = {executor.submit(convert_single_tgs) for x in range(100)}

        # Wait for all tasks to complete and retrieve results
        for future in concurrent.futures.as_completed(future_to_variable):
            variable = future.result()

def py_rlottie_mult():
    run(
        convMultLottie(
            filemaps=[
                FileMap(
                    LottieFile("sample.tgs"),
                    {
                        "test.gif",
                    },
                )
                for x in range(100)
            ]
        )
    )

def timeit(fn: Callable):
    start_time = time.time()
    fn()
    end_time = time.time()
    print(f"{fn.__name__} time: {end_time - start_time} seconds")

timeit(rlottie_py)
timeit(py_rlottie)
timeit(rlottie_py_mult)
timeit(py_rlottie_mult)

The results over a few runs are:


poetry run py perf.py
rlottie_py time: 2.2334213256835938 seconds
py_rlottie time: 4.818990230560303 seconds
rlottie_py_mult time: 183.9050681591034 seconds
py_rlottie_mult time: 101.32033944129944 seconds

---

rlottie_py time: 3.886620283126831 seconds
py_rlottie time: 6.14373254776001 seconds
rlottie_py_mult time: 159.29676604270935 seconds
py_rlottie_mult time: 114.90005588531494 seconds

--

rlottie_py time: 2.1622235774993896 seconds
py_rlottie time: 4.655423879623413 seconds
rlottie_py_mult time: 150.51082158088684 seconds
py_rlottie_mult time: 116.90487170219421 seconds

Turns out the pickle library throws a tempter tantrum when I use the ProcessPoolExecutor. I'll do some further experimentation with different number of threads as I agree that this could have a significant impact on the results

I noticed rlottie-python spent much of the time on [saving the rendered frames with Pillow]

My understanding was (perhaps mistakenly) that pillow was pretty optimised when it comes to reading and writing of images. I'll definitely do some more research into this too

User may want to manipulate the frames read from lottie file with python before saving.

Ultimately I feel that the rlottie-python library provides many benefits over the pyrlottie lib, so if I can get the performance closer for my use-case, and hopefully contribute that knowledge back here then that'd be awesome. Having said that, giving users options is never a bad thing so I'd certainly never just drop support for a lib without a good migration period

why macOS build of pyrlottie is not available

Ultimately this is as I do not have a mac and attempts at running cmake haven't been particularly successful so just been using the prebuit binaries. If you'd be happy to contribute these then that'd be enourmously appreciated

btw execution bit is lost not just in WSL, but also Linux in general.

Thanks for the heads up! I thought I'd squashed this bug!

laggykiller commented 5 months ago

Three problems of testing script:

  1. Using thread instead of process is much slower. The pickling issue can be solved by submitting rlottie_py to executor instead of convert_single_tgs
  2. Writing to same file may cause processes fighting for lock
  3. For some reason calling rlottie_py, then rlottie_py_mult cause deadlock. I need to investigate further... EDIT: I was testing on Linux, fixed by multiprocessing.set_start_method('spawn')

This is better:

import concurrent.futures
import multiprocessing
import os
import time
import timeit  # type: ignore
from typing import Any, Callable

from pyrlottie import FileMap, LottieFile, convMultLottie, convSingleLottie, run  # type: ignore
from rlottie_python import LottieAnimation

os.makedirs("test_rlottie_python", exist_ok=True)
os.makedirs("test_pyrlottie", exist_ok=True)

def rlottie_py(fname: str = "test"):
    anim = LottieAnimation.from_tgs("sample.tgs")
    anim.save_animation(f"test_rlottie_python/{fname}.gif")
    anim.lottie_animation_destroy()

def py_rlottie():
    run(convSingleLottie(LottieFile("sample.tgs"), {"test_pyrlottie/test.gif"}))

def rlottie_py_mult():
    with concurrent.futures.ProcessPoolExecutor(max_workers=int(multiprocessing.cpu_count())) as executor:
        # Using list comprehension to submit tasks to the executor
        future_to_variable = {executor.submit(rlottie_py, str(i)) for i in range(100)}

        # Wait for all tasks to complete and retrieve results
        for future in concurrent.futures.as_completed(future_to_variable):
            variable = future.result()

def py_rlottie_mult():
    run(
        convMultLottie(
            filemaps=[
                FileMap(
                    LottieFile("sample.tgs"),
                    {
                        f"test_pyrlottie/{i}.gif",
                    },
                )
                for i in range(100)
            ]
        )
    )

def timeit(fn: Callable[..., Any]):
    start_time = time.time()
    fn()
    end_time = time.time()
    print(f"{fn.__name__} time: {end_time - start_time} seconds")

if __name__ == "__main__":
        multiprocessing.set_start_method("spawn")

    timeit(rlottie_py)
    timeit(py_rlottie)
    timeit(rlottie_py_mult)
    timeit(py_rlottie_mult)

The result without modification:

# Arch Linux, 16 core Gen 8 Intel desktop
rlottie_py_mult time: 32.663668155670166 seconds
py_rlottie_mult time: 5.823666095733643 seconds
# Windows 11, 8 core Gen 8 Intel mobile
rlottie_py_mult time: 42.006444215774536 seconds
py_rlottie_mult time: 18.870269536972046 seconds

The result of using process instead of threads:

# Arch Linux, 16 core Gen 8 Intel desktop
rlottie_py_mult time: 11.285428285598755 seconds
py_rlottie_mult time: 5.824993133544922 seconds
# Windows 11, 8 core Gen 8 Intel mobile
rlottie_py_mult time: 25.225178003311157 seconds
py_rlottie_mult time: 19.338806629180908 seconds

rlottie-python still lost, but much faster already.


Here is result of running py-spy record -o result.svg -- python test.py (See my previous comment for content of test.py)

result

Interactive version (Download and unzip for the svg, open it in browser): result.zip

As mentioned, Pillow takes long time to save gif.

What if we save with pyav?

import av
from av.video.stream import VideoStream
import numpy as np

...

def rlottie_py(fname: str = "test"):
    anim = LottieAnimation.from_tgs("sample.tgs")
    frames = anim.lottie_animation_get_totalframe()
    fps = anim.lottie_animation_get_framerate()
    width, height = anim.lottie_animation_get_size()

    options = {
        "loop": "0"
    }

    with av.open(f"test_rlottie_python/{fname}.gif", "w", format="gif") as output:
        out_stream = output.add_stream("gif", rate=fps, options=options)
        out_stream = cast(VideoStream, out_stream)
        out_stream.pix_fmt = "rgb8"

        for i in range(frames):
            buffer = anim.lottie_animation_render(i)
            frame = np.frombuffer(buffer, dtype=np.uint8).reshape((width, height, 4))
            av_frame = av.VideoFrame.from_ndarray(frame, format="bgra")
            output.mux(out_stream.encode(av_frame))
        output.mux(out_stream.encode())

    anim.lottie_animation_destroy()

The result:

# Arch Linux, 16 core Gen 8 Intel desktop
rlottie_py_mult time: 6.875451564788818 seconds
py_rlottie_mult time: 5.936105728149414 seconds
# Windows 11, 8 core Gen 8 Intel mobile
rlottie_py_mult time: 17.615680694580078 seconds
py_rlottie_mult time: 17.997378826141357 seconds

Now rlottie-python is just off by 1 second on Arch Linux, and even winning for just a bit in Windows!


btw the time of using multiprocessing.cpu_count() and int(multiprocessing.cpu_count() / 2) are similar

FredHappyface commented 5 months ago

Thank you so much for your help on this, with these. With these optimisations I'm struggling to see the place that pyrlottie has really, one option might be to provide a few helper methods for if the user wants to blindly convert from a source to destination format like tgs, to say webp

With a 512x512 source image

Using ProcessPoolExecutor I got

rlottie_py time: 1.117741584777832 seconds
py_rlottie time: 4.003070592880249 seconds
rlottie_py_mult time: 52.18350124359131 seconds
py_rlottie_mult time: 202.2332363128662 seconds

Note: no idea why py_rlottie_mult time was so terrible for this run. Possibly windows defender not really liking me spawning random exes?

Using ProcessPoolExecutor and pyav I got

rlottie_py time: 0.986565351486206 seconds
py_rlottie time: 4.540838956832886 seconds
rlottie_py_mult time: 45.67705249786377 seconds

Just a note there was a minor bug with pyav, the out_stream.width and out_stream.height need setting explicitly it seems

with av.open(f"test_rlottie_python/{fname}.gif", "w", format="gif") as output:
        out_stream = output.add_stream("gif", rate=fps, options=options)
        out_stream = cast(VideoStream, out_stream)
        out_stream.width = width
        out_stream.height = height
laggykiller commented 5 months ago

Despite the better performance, I am not planning to add pyav related code to rlottie-python as:

Despite the performance, I think using Pillow to save animation is good enough in this project. Providing function for saving to file is a small bonus feature, save_animation() aims to be a minimalistic function for saving animation which 'just works' and 'good enough' for most use cases without intervention. If user wants more performance / better quality / save in other file format, they are free to choose their own method of saving in their own code, or even just ditch python and use C/C++ interface of rlottie directly.

FredHappyface commented 5 months ago

Makes perfect sense tbh. Plus I've just learned that pyav doesn't support webp which caught me out somewhat

Thanks for your time on this! 🙂

laggykiller commented 5 months ago

pyav doesn't support webp

It supports encoding webp but not decoding webp.

If you'd be happy to contribute these then that'd be enourmously appreciated

Opened a PR: https://github.com/FHPythonUtils/PyRlottie/pull/5


Since the performance issue is on Pillow, not on binding code to rlottie, I am closing this.