opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
76.24k stars 55.63k forks source link

V4D - A High performance visualization module with GUI #22923

Open kallaballa opened 1 year ago

kallaballa commented 1 year ago

Describe the feature and motivation

I am writing a high performance visualization module. It should support OpenGL, 2D vector graphics (through NanoVG) a GUI system (based on ImGui) and hardware acceleration whenever it can (e.g.: OpenGL, CL-GL interop and CL-VA interop). Also drawing on top of streaming videos should be supported. It must support WebAssembly. I would also like to support Vulkan, DirectX and Metal - through ANGLE.

I have released an alpha version of the module and the documentation.

At the moment it depends on pull requests that haven't been merged yet but could work without, just slower: #22780 and #22704

What do you think?

kallaballa commented 1 year ago

I made a rather fundamental architectural change: every context (fb, gl, nvg, ..) has it's own offscreen window and copies back and forth to the main (on-screen) frambuffer using interop. I can't observe any performance hit but i suspect some latency (which is absolutely no priority). But the real interesting part is when it comes to other graphics APIs than OpenGL... for example, that way i could implement a make-shift (but fast!) vulkan<->opengl<->opencl interop bridge.

Would look something like that:

v4d->vk([](cv::Size sz) {
//do vulkan things
});
v4d->fb([](cv::UMat& fb){
//access the result of the vulkan context using interop bridge.
});

Same might work with Metal but i haven't really looked into it yet.

kallaballa commented 1 year ago

But is there any interest in vulkan support?

kallaballa commented 1 year ago

Viz2D only depends on core modules while the samples depend on additional contrib modules. How and when should I enable building of samples and introduce the dependencies?

Something like this: https://github.com/opencv/opencv_contrib/blob/298fc7be958c10be8a8e807cf9a0dde512c13af5/modules/ovis/CMakeLists.txt#L27

Done.

I integrated https://github.com/bloomen/cxxpool which is a MIT header-only implementation. Is that alright?

OpenCV already have it's own thread pool (cv::parallel_for), it is better to reuse it if possible. Adding more threads can result in conflicts, oversubscription and performance degradation.

I'd love to reuse something from OpenCV but i couldn't find anything that fits my pattern. I use the pool to allocate threads for parallel capture/write while rendering/processing. https://github.com/kallaballa/V4D/blob/a5fb870c96f6da077f5e94744a669516e7cebde8/modules/v4d/src/v4d.cpp#L345-L367

What kind of tests should i write? performance?

Some basic accuracy tests verifying module functionality. E.g. creating windows, showing images, etc.. Performance tests can be added in case if you expect some functions to have optimized implementation or for comparing between various HW/SW configurations.

Well, basic tests like displaying images i have and i also have more complex demos that could be made performance tests.

kallaballa commented 1 year ago

Implemented multi-context support for GLFW+emscripten. So for the meantime WebAssembly support only with my fork till the PR is through.

kallaballa commented 1 year ago

I embedded the WASM builds of all examples and demos (except beauty-demo) into the doxygen. There is still a lot to do but it's fun having live applications for the tutorial code. :)

Please note that some demos have a "Start capture"-button which starts a video-feed from your camera.

kallaballa commented 1 year ago

I've spend the weekend optimizing rendering performance which was especially interesting because I am targeting three APIs:

Though my success was limited (compared to my expectation) I would like to share my findings.

The two things I was looking to optimize was opencv's access to the framebuffer and inter-opengl-context sharing. At the moment there are two mechanisms in place to achieve framebuffer access. CL-GL interop and glReadPixels/glTexSubImage2D. CL-GL is really fast (also because i optimized it a bit) but not available on all systems. So i looked into alternatives glReadPixels/glTexSubImage2D and decided to implement PBO downloading. Basically this is a kind of asynchronous download that trades throughput for latency. OpenGL inter-context sharing is required to be able to support several rendering APIs side by side. Also, even if just one API is used, framebuffer access requires certain GL-states that might be in concurrence to that one API so this is really important vor V4D.

PBO download findings

OpenGL context sharing findings

kallaballa commented 1 year ago

I am still investigating ways to share context data (like virtual-webgl) which would make things much easier but i have another solution if performance is really needed: In WASM mode i could turn off context separation entirely and advice the user to either use only one API (e.g. nanovg) or to do manual GL state-management. That would still not yield performance comparable to a desktop build but it should be much faster and portable (if you turn off context separation on desktop too).

Anyway I'm also doing a bit of research into the future of Graphics APIs and I think should consider WebGPU support soon (rather than vulkan). That said, I couldn't find any resources on the possibility of WebGPU-OpenGL interop - which would make things really easy.

kallaballa commented 1 year ago

I found a workaround based on virtual-webgl and the minimal opengl example runs more than 3x faster :).

kallaballa commented 1 year ago

Fixed the occasional crash. Please don't forget to cache-reload.

kallaballa commented 1 year ago

I implemented a (still crude and a bit buggy) camera capturing interface that has very little impact on the overall FPS. That makes most or even all (depending on your machine) examples and demos actually usable. Except from an input event handling api and a lot of polishing and docs, this was the last big task to a beta version :)

kallaballa commented 1 year ago

Check it out: https://viel-zu.org/opencv/doxygen/html/da/d2b/v4d_optflow.html (Please don't forget to cache-reload. :))

kallaballa commented 1 year ago

I replaced the CPU-copy based camera capture code with a GPU-copy based one (using texImage2D on a html video element) for a little performance boost - because in WASM we have to down/up-load at some point (at least most of the times) to be able to do something with the data.

kallaballa commented 1 year ago

I found time to rewrite part of the webgl specific implementation. I used to rely on virtual-webgl2 to provide inter-context resource sharing but it proved to be not equipped for more complex cases. so I instead opted to copy the color buffer off a canvas (of one context) using textImage2D (into another context) - which is a GPU copy. Addtionally i needed to blend textures using shaders.

The demos are still a bit glitchy but they work, and i finally added beauty-demo

kallaballa commented 12 months ago

I fixed the glitches with only one problem remaining. on some system configurations rendering above 60 FPS results in flickering. Does this demo flicker on you system configuration? If so, would you post your configuration?

kallaballa commented 11 months ago

Fixed the flicker at the cost of FPS (for webkit).

kallaballa commented 11 months ago

Fixed some glitches and optimized performance. e.g. cube-demo went from about 130 FPS to around 200 FPS.

Just for future reference: CPU: Intel i7-1160G7 GPU: Intel Tiger Lake-UP4 GT2 [Iris Xe Graphics] (rev 01) Kernel: 6.3.6-1-default Display Server: X.Org X Server 1.21.1.8 Browser: Chromium Version 114.0.5735.106 (openSUSE Build) stable (64-bit)

kallaballa commented 11 months ago

I got OpenGL 3.2 Core, OpenGL ES 3.0 and WebGL2 working at the same time for all examples and demos. Additionally for the first time I made a broader browser test (only up-to-date browsers on different platforms). WebKit is the fastest and works equally well across browsers and operating systems. On Linux I found performance problems with Firefox that are due to an inefficiency in their teximage2d call for html5-canvases. I believe i can work around it, but I am going to file a bug anyway. Safari doesn't work... Anyway, the big suprise for me was performance on mobile devices - it's unexpectedly good (e.g. on my Xiaomi Redmi Note 8 Pro).

kallaballa commented 11 months ago

Updated the documentation (especially the native build instructions).

kallaballa commented 11 months ago

I think i solved all the big challenges and I am closing in on feature completeness (at least for a first release). Except from fixing bugs the only major thing missing are a simple callback-based input-event-system and a complete and up-to-date api documentation. Anyway, I think I am finished with the tutorials and demos. They consist of a short description, a WASM live-demo and commented source code.

The tutorials and The demos are designed to be read one after the other to give you a good overview over the key concepts of V4D. The demos show how to use V4D to the fullest and create programs that run mostly (the part the matters) on the GPU (when driver capabilities allow it). They are also a good starting point for your own applications because they touch many key aspects and algorithms of OpenCV.

Feedback is very welcome!

kallaballa commented 11 months ago

Btw. because I was interested in compiling the demos, right from the docs in the browser, I had a look at emception. Really interesting but also very slow... to slow. Anyway I keep an eye on it. Wouldn't it be cool to be able to edit and compile c++ examples in the docs, right in place?

kallaballa commented 11 months ago

I made an alpha release Debian-packages, installer and binary tarball.

have fun :)

kallaballa commented 10 months ago

I took time to familiarize myself with the future of extensions/technologies/APIs pertaining V4D and I think the best route to get a cross platform solution would be to put V4D in GLES3 mode and let it run atop ANGLE. GLES3 is also the requirement for Dawn which would provide a WebGPU implentation. WebGPU interop is implemented as a canvas-copy operation (texImage2d on a HTMLCanvasElement without a CPU round-trip). The same could work for WEBNN interop. Also i found cl_intel_unified_shared_memory

kallaballa commented 9 months ago

I half-way integrated Dawn for WebGPU support, just to realize it's not there yet for my use-case - but interop works.

kallaballa commented 8 months ago

I implemented a new demo that shows how to use multiple independent OpenGL contexts/states for rendering: https://viel-zu.org/opencv/doxygen/html/d4/d4a/v4d_many_cubes.html

Also, I have a new machine to test on with a 13th Gen Intel CPU and a NVIDIA 4070 Ti GPU. I started to test and optimize for those platforms too and especially NVIDIA support should be much better.

kallaballa commented 8 months ago

I had quite my share of problems with NanoGUI. NanoVG alone works really well. So I am gonna take a step back and replace NanoGUI/NanoVG with bgfx/imgui which supports rendering via NanoVG as well.

kallaballa commented 8 months ago

Won't work. Had to go through the source code to find out that custom backbuffers aren't supported in the GL-renderer. So... NanoVG/imgui is the next try.

kallaballa commented 8 months ago

Won't work. Had to go through the source code to find out that custom backbuffers aren't supported in the GL-renderer. So... NanoVG/imgui is the next try.

That's a much nicer mix of APIs! screenshot of a visualization with a gui I have the basics working :)

kallaballa commented 8 months ago

A preview of the new GUI. Originally in 1080p.

https://github.com/opencv/opencv/assets/287266/3eff9258-7f93-47ba-8bcb-975678eefcbb

kallaballa commented 8 months ago

I found time to fix up the web demos. they are stable and (relatively) fast now (still slower on firefox). Anyway, more interesting I found my native-binary tests with multiple contexts. I took the Many_Cubes-Demo and tested performance with different numbers of contexts.

1 Cube -> ~300FPS 10 Cubes -> ~150FPS 100 Cubes -> ~ 130FPS 1000 Cubes -> ~ 12FPS

Please note that each of the cubes belongs to an independent GL-context/window.

kallaballa commented 8 months ago

I added a new example to the tutorial series: example_v4d_display_image_nvg It uses NanoVG to load, transform and blit an image.

Also I'm done porting all examples and demos and invested some time in performance optimization and polishing. Now is a really good time to check them out!

kallaballa commented 8 months ago

I created a mobile friendly design for the documentation based on doxygen-bootstrapped.

https://viel-zu.org/opencv/doxygen/html/d7/dfc/v4d.html

Enjoy!

kallaballa commented 8 months ago

I made a new alpha-release: https://github.com/kallaballa/V4D/releases/tag/0.0.6-alpha

kallaballa commented 8 months ago

I implemented partial context allocation, which means you can choose which subsystems will be allocated... if you don't need NanoVG don't allocate it.... Annnnnd... I implemented a simple form of pipelining. It is fascinating how different hardware & work-load react to the pipelining. For some demos it yields great gain while for others it is even slightly detrimental. You can add workers now by specifying their number in the V4D::run call.

kallaballa commented 8 months ago

My recent experiments with pipelining lead me to the decision that I want to support easy interoperability with G-API in three steps.

  1. Implement a Source and Sink for G-API streams
  2. Implement a V4D-G-API context where you can define/pass your graph and have it interact with other V4D-contexts
  3. Use G-API as a execution backend for all V4D tasks.

Point 1 and 2 are pretty easy and I've already prototyped it. Point 3 is tricky but I've made some promising experiments and start to understand how to do it properly. However, for Point 3 I only want to support linear graph creation and I won't tackle it before release 1.0.

kallaballa commented 8 months ago

I investigated platform support through ANGLE and it doesn't look good. No CLGL, problems with multi-threading and more. If I'm going to implement it, it will just be an optional fallback.

Also... I ported all examples & demos to have primitive pipelining support. Not all of them benefit from it, so it is turned off by default.... But the beefy ones like beauty-demo benefit greatly!

kallaballa commented 8 months ago

Btw. pipelining currently doesn't work for WebAssembly builds and I doubt it would be beneficial, because the WebAssembly builds spend the majority of time in the main thread to be able to access browser resources (WebGL2, video capture, etc.)

kallaballa commented 7 months ago

Ever asked yourself what OpenGL multi-context rendering can do for you? If you get some details right (like flushing or finishing at the right time) it can do a lot performance wise! In this video I show how rendering the same code (with context-local gl-objects) on multiple contexts behaves performance- and GPU-usage-wise with different numbers of contexts. First run: 1 Cube = 1 Context Second run: 10 Cubes = 10 Contexts Third run: 300 Cubes = 300 Contexts

Please note that this demonstration doesn't use any kind of multi-threading to enhance OpenGL rendering performance. All GL-calls happen in the same thread.

https://github.com/opencv/opencv/assets/287266/30593038-d0a6-4af6-a450-7da01399f5db

kallaballa commented 7 months ago

You can find the source for the many_cubes-demo here. But beware, much has changed API wise, but more about that in my next post.

kallaballa commented 7 months ago

OK. Here's about the new API. The idea behind it is to enable parallel/graph execution of context calls (which run invocables) in a safe and fast manner without changing the overall API too much and with only little or no boiler plate compared to the raw algorithm. This is done by extending class Plan (which represents a serialized graph definition) So instead of executing the call right away, we store the invocable and references to the arguments. Storing references instead of values is important for automatic inference. Later on G-API should be supported as backend.

The five foremost API-rules:

  1. All code to be inferred must be executed through a context call (except in the setup and teardown phase which may do calls without)
  2. The invocable (which is passed to all context calls) must be a stateless lambda. That means you can't capture anything, not even by value, except for implicit captures.
  3. Limit implicit captures to constants and thread_locals. If you must access a non-thread-local resource makes sure it is thread-safe.
  4. Since we only store references to the arguments the lifetime of the arguments must be coupled to the lifetime of the Plan
  5. As previously mentioned the lambda may take arguments. Carefully evaluate constness of arguments because const arguments are treated as read (in-edge) and non-const are treated as READ-WRITE (in/out-edge).

Some of these and other rules are already compile time checked (e.g. stateless lamba) and some are merely realized by convention. The compile time checks are currently implemented based on enable_if, static_assert, etc.. but should be implemented using c++ contraints and concepts

A commented example:

#include <opencv2/v4d/v4d.hpp>
#include <opencv2/imgcodecs.hpp>

using namespace cv;
using namespace cv::v4d;

//We need to extend class Plan which has several virtual functions we could override, but in our case we only need 'setup' and 'infer'
class DisplayImageFB : public Plan {
    //Used as arguments in context calls. by being a member of the Plan object lifetime is automatically bound to it.
    UMat image_;
    UMat converted_;
public:
    //setup-phase: inferred and executed only once.
    void setup(cv::Ptr<V4D> win) override {
        //a parallel context can run in parallel in different threads opposed to a single context which is globally mutexed.
        //except for that parallel doesn't do any special context setup.
        //As you can see some lambda arguments are defined as const to mark them as in-edge in the graph.
        win->parallel([](cv::UMat& image, cv::UMat& converted, const cv::Size& sz){
            //Loads an image as a UMat (just in case we have hardware acceleration available)
            image = imread(samples::findFile("lena.jpg")).getUMat(ACCESS_READ);
            //We have to manually resize to framebuffer size
            resize(image, converted, sz);
            cvtColor(converted, converted, COLOR_RGB2BGRA);
        }, image_, converted_, win->fbSize()); // this is where the references are passed. Again, everything passed here must have a lifetime longer than the execution of the plan
    }

    //infer-phase: inferred and executed in a loop
    void infer(Ptr<V4D> win) override {
        //Create a fb context and copy the prepared image to the framebuffer.
        win->fb([](UMat& framebuffer, const cv::UMat& c){
            c.copyTo(framebuffer);
        }, converted_);
    }
};

int main() {
    //Creates a V4D object
    Ptr<V4D> window = V4D::make(960, 960, "Display an Image through direct FB access");
    //Runs the plan using in total 1 thread (= 0 extra threads).
    window->run<DisplayImageFB>(0);
}
kallaballa commented 7 months ago

For now you have to browse the other samples to see all context calls in action. a relative simple sample of how to do conditional branching in the serialized graph is font-demo

kallaballa commented 7 months ago

Pipelining performance demo. In this video i demonstrate how adding pipeline workers enhances performance. First run: 1 Thread Second run: 2 Threads Thread run: 3 Threads

source code

https://github.com/opencv/opencv/assets/287266/54f69e33-fb48-4bae-bd33-d6d848feb429

kallaballa commented 7 months ago

And what if i want to combine multiple demos into one large graph?

https://github.com/opencv/opencv/assets/287266/0fffc39d-7b90-4bf1-b68d-cc20af2fed0d

The code (which isn't very elegant yet):

int v4d_cube_main();
int v4d_many_cubes_main();
int v4d_video_main(int argc, char **argv);
int v4d_nanovg_main(int argc, char **argv);
#define main v4d_cube_main
#include "cube-demo.cpp"
#undef main
#define main v4d_many_cubes_main
#include "many_cubes-demo.cpp"
#undef main
#define main v4d_video_main
#include "video-demo.cpp"
#undef main
#define main v4d_nanovg_main
#include "nanovg-demo.cpp"
#undef main

class MontageDemoPlan : public Plan {
    CubeDemoPlan cubePlan_;
    ManyCubesDemoPlan manyCubesPlan_;
    VideoDemoPlan videoPlan_;
    NanoVGDemoPlan nanovgPlan_;

    cv::UMat cube_;
    cv::UMat many_cubes_;
    cv::UMat video_;
    cv::UMat nanovg_;
public:
    virtual void setup(cv::Ptr<V4D> window) override {
        cubePlan_.width_ = 1920;
        cubePlan_.height_ = 1080;
        manyCubesPlan_.width_ = 1920;
        manyCubesPlan_.height_ = 1080;
        videoPlan_.width_ = 1920;
        videoPlan_.height_ = 1080;
        nanovgPlan_.width_ = 1920;
        nanovgPlan_.height_ = 1080;

        cubePlan_.setup(window);
        manyCubesPlan_.setup(window);
        videoPlan_.setup(window);
        nanovgPlan_.setup(window);
    }

    virtual void infer(cv::Ptr<V4D> window) override {
        cubePlan_.infer(window);
        window->fb([](const cv::UMat& framebuffer, cv::UMat& cube){
            framebuffer.copyTo(cube);
        }, cube_);

        manyCubesPlan_.infer(window);
        window->fb([](const cv::UMat& framebuffer, cv::UMat& many_cubes){
            framebuffer.copyTo(many_cubes);
        }, many_cubes_);

        videoPlan_.infer(window);
        window->fb([](cv::UMat& framebuffer, cv::UMat& video){
            framebuffer.copyTo(video);
        }, video_);

        nanovgPlan_.infer(window);
        window->fb([](cv::UMat& framebuffer, cv::UMat& cube, cv::UMat& many_cubes, cv::UMat& video, cv::UMat& nanovg){
            framebuffer.copyTo(nanovg);
            cv::resize(cube, framebuffer(cv::Rect(0, 0, 960, 540)), cv::Size(960, 540));
            cv::resize(many_cubes, framebuffer(cv::Rect(960, 0, 960, 540)), cv::Size(960, 540));
            cv::resize(video, framebuffer(cv::Rect(0, 540, 960, 540)), cv::Size(960, 540));
            cv::resize(nanovg, framebuffer(cv::Rect(960, 540, 960, 540)), cv::Size(960, 540));
        }, cube_, many_cubes_, video_, nanovg_);

    }

    virtual void teardown(cv::Ptr<V4D> window) override {
        cubePlan_.teardown(window);
        manyCubesPlan_.teardown(window);
        videoPlan_.teardown(window);
    }
};

int main(int argc, char** argv) {
#ifndef __EMSCRIPTEN__
    if (argc != 2) {
        cerr << "Usage: montage-demo <video-file>" << endl;
        exit(1);
    }
    constexpr double FPS = 60;
    constexpr const char* OUTPUT_FILENAME = "montage-demo.mkv";
#else
    CV_UNUSED(argc);
    CV_UNUSED(argv);
#endif
    using namespace cv::v4d;
    cv::Ptr<MontageDemoPlan> plan = new MontageDemoPlan();
    cv::Ptr<V4D> window = V4D::make(1920, 1080, "Montage Demo", ALL);
#ifndef __EMSCRIPTEN__
    //Creates a source from a file or a device
    auto src = makeCaptureSource(window, argv[1]);
    window->setSource(src);
    //Creates a writer sink (which might be hardware accelerated)
    auto sink = makeWriterSink(window, OUTPUT_FILENAME, FPS, cv::Size(1920, 1080));
    window->setSink(sink);
#endif
    window->run(plan, 0);

    return 0;
}
kallaballa commented 7 months ago

A bit of profiling and thinking made me take the following decision on choosing the numbers of workers (and later on the maximum number of parallel tasks instead). Not passing the number of workers parameter defaults to -1 which instructs V4D to choose the number automatically. At the moment that will always be one extra worker which is equal to passing 1 to run, because:

  1. If capture and/or write are involved, which are pretty slow atomic operations, pipelining will lead to parallel capture/write in 2 threads - just not at the same time. Therefor substantial gain is to be expected in that case.
  2. If adding an extra worker doesn't yield significant gain, there is at least only the overhead of on worker.
  3. This will in many cases not produce the optimal gain, but I settle for good gain, for now.

Passing 0 means no extra worker = 1 thread in total Passing N>0 means N extra workers = N+1 threads in total

kallaballa commented 7 months ago

All demos combined into one big graph doing multi-context rendering and pipelining at the same time. There is still a lot to do, but it works :)

source

https://github.com/opencv/opencv/assets/287266/5782552a-acad-4d8c-a85b-68bfcd9d42b5

kallaballa commented 7 months ago

Simplified and polished a bit.

source code

Also made a video showing performance with and without pipelining.

https://github.com/opencv/opencv/assets/287266/241d251d-3341-49ae-83c1-42fef185691a

kallaballa commented 7 months ago

I wondered what combination of techniques would yield the best performance for rendering 10000 visible cubes at the same time. The short answer: minimal pipelining + multi-context rendering + rendering multiple cubes for each context yields the best result. The exact values for the number of threads and contexts matters a lot, performance wise.

My PC can render 10000 individually rotating cubes at ~220 FPS with the following code.: (code is buggy)

#include <opencv2/v4d/v4d.hpp>

using namespace cv::v4d;
class ManyCubesDemoPlan : public Plan {
public:
    /* Demo Parameters */
    constexpr static size_t NUM_WORKERS_ = 1;
    constexpr static size_t NUM_CONTEXTS_ = 32;
    constexpr static size_t MAX_NUM_CUBES_ = 20000 / (NUM_WORKERS_ + 1);
    constexpr static std::string subtitlesFile_ = "sub.txt";

    /* OpenGL constants and variables */
    constexpr static GLuint TRIANGLES_ = 12;
    constexpr static GLuint VERTICES_INDEX_ = 0;
    constexpr static GLuint COLORS_INDEX_ = 1;

    //Cube vertices, colors and indices
    constexpr static float VERTICES_[24] = {
        // Front face
        0.5, 0.5, 0.5, -0.5, 0.5, 0.5, -0.5, -0.5, 0.5, 0.5, -0.5, 0.5,
        // Back face
        0.5, 0.5, -0.5, -0.5, 0.5, -0.5, -0.5, -0.5, -0.5, 0.5, -0.5, -0.5
    };

    constexpr static float VERTEX_COLORS_[24] = {
            0.0, 0.1, 0.7,
            0.1, 0.2, 0.7,
            0.2, 0.3, 0.7,
            0.3, 0.4, 0.7,
            0.2, 0.3, 0.7,
            0.1, 0.2, 0.7,
            0.1, 0.1, 0.7,
            0.0, 0.1, 0.7
    };

    constexpr static unsigned short TRIANGLE_INDICES_[36] = {
        // Front
        0, 1, 2, 2, 3, 0,

        // Right
        0, 3, 7, 7, 4, 0,

        // Bottom
        2, 6, 7, 7, 3, 2,

        // Left
        1, 5, 6, 6, 2, 1,

        // Back
        4, 7, 6, 6, 5, 4,

        // Top
        5, 1, 0, 0, 4, 5
    };
private:
    std::vector<GLuint> vao_;
    std::vector<GLuint> shaderProgram_;
    std::vector<GLuint> uniformTransform_;

    //Simple transform & pass-through shaders
    static GLuint load_shader() {
#if !defined(OPENCV_V4D_USE_ES3)
        const string shaderVersion = "330";
#else
        const string shaderVersion = "300 es";
#endif

        const string vert =
                "    #version " + shaderVersion
                        + R"(
        precision lowp float;
        layout(location = 0) in vec3 pos;
        layout(location = 1) in vec3 vertex_color;

        uniform mat4 transform;

        out vec3 color;
        void main() {
          gl_Position = transform * vec4(pos, 1.0);
          color = vertex_color;
        }
    )";

        const string frag =
                "    #version " + shaderVersion
                        + R"(
        precision lowp float;
        in vec3 color;

        out vec4 frag_color;

        void main() {
          frag_color = vec4(color, 1.0);
        }
    )";

        //Initialize the shaders and returns the program
        return cv::v4d::initShader(vert.c_str(), frag.c_str(), "fragColor");
    }

    //Initializes objects, buffers, shaders and uniforms
    static void init_scene(const cv::Size& sz, GLuint& vao, GLuint& shaderProgram, GLuint& uniformTransform) {
        glEnable (GL_DEPTH_TEST);

        glGenVertexArrays(1, &vao);
        glBindVertexArray(vao);

        unsigned int triangles_ebo;
        glGenBuffers(1, &triangles_ebo);
        glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, triangles_ebo);
        glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof TRIANGLE_INDICES_, TRIANGLE_INDICES_,
                GL_STATIC_DRAW);

        unsigned int verticies_vbo;
        glGenBuffers(1, &verticies_vbo);
        glBindBuffer(GL_ARRAY_BUFFER, verticies_vbo);
        glBufferData(GL_ARRAY_BUFFER, sizeof VERTICES_, VERTICES_, GL_STATIC_DRAW);

        glVertexAttribPointer(VERTICES_INDEX_, 3, GL_FLOAT, GL_FALSE, 0, NULL);
        glEnableVertexAttribArray(VERTICES_INDEX_);

        unsigned int colors_vbo;
        glGenBuffers(1, &colors_vbo);
        glBindBuffer(GL_ARRAY_BUFFER, colors_vbo);
        glBufferData(GL_ARRAY_BUFFER, sizeof VERTEX_COLORS_, VERTEX_COLORS_, GL_STATIC_DRAW);

        glVertexAttribPointer(COLORS_INDEX_, 3, GL_FLOAT, GL_FALSE, 0, NULL);
        glEnableVertexAttribArray(COLORS_INDEX_);

        glBindVertexArray(0);
        glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0);
        glBindBuffer(GL_ARRAY_BUFFER, 0);

        shaderProgram = load_shader();
        uniformTransform = glGetUniformLocation(shaderProgram, "transform");
        glViewport(0,0, sz.width, sz.height);
    }

    //Renders a rotating rainbow-colored cube on a blueish background
    static void render_scene(const cv::Size& sz, const double& x, const double& y, const double& angleMod, GLuint& vao, GLuint& shaderProgram, GLuint& uniformTransform) {
        glViewport(0,0, sz.width, sz.height);
        //Use the prepared shader program
        glUseProgram(shaderProgram);

        //Scale and rotate the cube depending on the current time.
        float angle =  fmod(double(cv::getTickCount()) / double(cv::getTickFrequency()) + angleMod, 2 * M_PI);
        double scale = 0.05;
        cv::Matx44f scaleMat(
                scale, 0.0, 0.0, 0.0,
                0.0, scale, 0.0, 0.0,
                0.0, 0.0, scale, 0.0,
                0.0, 0.0, 0.0, 1.0);

        cv::Matx44f rotXMat(
                1.0, 0.0, 0.0, 0.0,
                0.0, cos(angle), -sin(angle), 0.0,
                0.0, sin(angle), cos(angle), 0.0,
                0.0, 0.0, 0.0, 1.0);

        cv::Matx44f rotYMat(
                cos(angle), 0.0, sin(angle), 0.0,
                0.0, 1.0, 0.0, 0.0,
                -sin(angle), 0.0,cos(angle), 0.0,
                0.0, 0.0, 0.0, 1.0);

        cv::Matx44f rotZMat(
                cos(angle), -sin(angle), 0.0, 0.0,
                sin(angle), cos(angle), 0.0, 0.0,
                0.0, 0.0, 1.0, 0.0,
                0.0, 0.0, 0.0, 1.0);

        cv::Matx44f translateMat(
                1.0, 0.0, 0.0, 0.0,
                0.0, 1.0, 0.0, 0.0,
                0.0, 0.0, 1.0, 0.0,
                  x,   y, 0.0, 1.0);

        //calculate the transform
        cv::Matx44f transform = scaleMat * rotXMat * rotYMat * rotZMat * translateMat;
        //set the corresponding uniform
        glUniformMatrix4fv(uniformTransform, 1, GL_FALSE, transform.val);
        //Bind our vertex array
        glBindVertexArray(vao);
        //Draw
        glDrawElements(GL_TRIANGLES, TRIANGLES_ * 3, GL_UNSIGNED_SHORT, NULL);
    }
public:
    ManyCubesDemoPlan(cv::Size sz) : Plan(sz) {
        vao_ = std::vector<GLuint>(MAX_NUM_CUBES_);
        shaderProgram_ = std::vector<GLuint>(MAX_NUM_CUBES_);
        uniformTransform_ = std::vector<GLuint>(MAX_NUM_CUBES_);
    }

    void setup(cv::Ptr<V4D> window) override {
        for(int32_t i = 0; i < NUM_CONTEXTS_; ++i) {
            window->gl(i, [](const int32_t& ctxIdx, const cv::Size& sz, std::vector<GLuint>& vao, std::vector<GLuint>& shader, std::vector<GLuint>& uniformTrans){
                CV_UNUSED(ctxIdx);
                for(size_t j = 0; j < MAX_NUM_CUBES_ / NUM_CONTEXTS_; ++j) {
                    init_scene(sz, vao[j], shader[j], uniformTrans[j]);
                }
            }, size(), vao_, shaderProgram_, uniformTransform_);
        }
    }

    void infer(cv::Ptr<V4D> window) override {
        window->gl([](){
            //Clear the background
            glClearColor(0.0, 0.0, 0.0, 1);
            glClear(GL_COLOR_BUFFER_BIT);
        });

        //Render using multiple OpenGL contexts
        for(int32_t i = 0; i < NUM_CONTEXTS_; ++i) {
            window->gl(i, [](const int32_t& ctxIdx, const cv::Size& sz, std::vector<GLuint>& vao, std::vector<GLuint>& shader, std::vector<GLuint>& uniformTrans){
                for(size_t j = 0; j < MAX_NUM_CUBES_ / NUM_CONTEXTS_; ++j) {
                    size_t gw = MAX_NUM_CUBES_ / NUM_CONTEXTS_;
                    size_t k = ctxIdx * NUM_CONTEXTS_ + j;
                    size_t xpos =  k % gw;
                    size_t ypos =  k / gw;
                    double x = (xpos * (2.0 /gw)) -1;
                    double y = (ypos * (2.0 /gw)) -1;
                    double angle = ((x * sin((x * y) * 2 * M_PI)) + (y * cos((x * y) * 2 * M_PI)) / 2.0);
                    render_scene(sz, x, y + 1, angle, vao[k], shader[k], uniformTrans[k]);
                }
            }, size(), vao_, shaderProgram_, uniformTransform_);
        }

        window->write();
    }
};

int main() {
    cv::Ptr<ManyCubesDemoPlan> plan = new ManyCubesDemoPlan(cv::Size(1920, 1080));
    cv::Ptr<V4D> window = V4D::make(plan->size(), "Many Cubes Demo", NANOVG);

    constexpr double FPS = 60;
    constexpr const char* OUTPUT_FILENAME = "many_cubes-demo.mkv";
    auto sink = makeWriterSink(window, OUTPUT_FILENAME, FPS, plan->size());
    window->setPrintFPS(true);
    window->setShowFPS(false);
    window->setShowTracking(false);
    window->setSink(sink);
    window->run(plan, ManyCubesDemoPlan::NUM_WORKERS_);

    return 0;
}
seclorum commented 7 months ago

@kallaballa Impressive bit of showcasing. Bookmarked for further entertainment .. this thread is quite interesting!

kallaballa commented 7 months ago

I forgot to remove video writing. Without it i get 1380 FPS :). I have to redo my experiments.

seclorum commented 7 months ago

@kallaballa Will you push a WebAssembly build of things at some point in the future?

kallaballa commented 7 months ago

@seclorum Next alpha release!