godotengine / godot-proposals

Godot Improvement Proposals (GIPs)
MIT License
1.17k stars 98 forks source link

Implement threading to speed up video saving in Movie Maker mode #4751

Open Calinou opened 2 years ago

Calinou commented 2 years ago

Related to https://github.com/godotengine/godot-proposals/issues/4752.

Describe the project you are working on

The Godot editor :slightly_smiling_face:

Describe the problem or limitation you are having in your project

Godot now has offline video recording functionality. This works well, but it's fairly slow when recording at high resolutions since encoding the MJPEG can take a relatively long time. This is even worse when using PNG output as PNG is slow to encode compared to MJPEG.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Move MJPEG/PNG encoding to a thread pool, so it can be performed in parallel and avoid blocking the main thread (so it can keep on rendering frames).

This has a potential to double video writing performance (or more, on CPUs with higher core counts).

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

If this enhancement will not be used often, can it be worked around with a few lines of script?

No.

Is there a reason why this should be core and not an add-on in the asset library?

This is about improving Movie Maker recording performance to make it more usable for video production.

myaaaaaaaaa commented 1 year ago

Alternatively for lossless video, there's also the option to add support for outputting .bmp sequences, which should be substantially faster than compressing to .png.

Calinou commented 1 year ago

Alternatively for lossless video, there's also the option to add support for outputting .bmp sequences, which should be substantially faster than compressing to .png.

This is something I indirectly proposed in https://github.com/godotengine/godot-proposals/issues/6390 :slightly_smiling_face:

Godot doesn't have save_bmp() or save_tga() functions in Image yet though. It should be relatively easy to implement those, at least for RGB8 and RGBA8 formats.

Adding a way to save uncompressed images will help improve performance to an extent, but the benefit is often much smaller compared to threading. See this comment and its replies for an example of this in SHADERed.

myaaaaaaaaa commented 1 year ago

After further investigation, I've found that uncompressed images do in fact provide optimal performance, in the sense that the fps is identical to just disabling frame writing altogether:

diff --git a/main/main.cpp b/main/main.cpp
index 87e83f65c1..dc051e74ce 100644
--- a/main/main.cpp
+++ b/main/main.cpp
@@ -3277,7 +3277,7 @@ bool Main::iteration() {
        RID main_vp_rid = RenderingServer::get_singleton()->viewport_find_from_screen_attachment(DisplayServer::MAIN_WINDOW_ID);
        RID main_vp_texture = RenderingServer::get_singleton()->viewport_get_texture(main_vp_rid);
        Ref<Image> vp_tex = RenderingServer::get_singleton()->texture_2d_get(main_vp_texture);
-       movie_writer->add_frame(vp_tex);
    }

    if ((quit_after > 0) && (Engine::get_singleton()->_process_frames >= quit_after)) {

However, there is still overhead compared to not having Movie Maker on, which is only resolved after also disabling texture downloading:

diff --git a/main/main.cpp b/main/main.cpp
index 87e83f65c1..02f83e30d1 100644
--- a/main/main.cpp
+++ b/main/main.cpp
@@ -3275,9 +3275,7 @@ bool Main::iteration() {

    if (movie_writer) {
        RID main_vp_rid = RenderingServer::get_singleton()->viewport_find_from_screen_attachment(DisplayServer::MAIN_WINDOW_ID);
        RID main_vp_texture = RenderingServer::get_singleton()->viewport_get_texture(main_vp_rid);
-       Ref<Image> vp_tex = RenderingServer::get_singleton()->texture_2d_get(main_vp_texture);
-       movie_writer->add_frame(vp_tex);
    }

    if ((quit_after > 0) && (Engine::get_singleton()->_process_frames >= quit_after)) {

In other words, the overhead comes from a render pipeline stall rather than compressing the image data:

4555 gles-sync

Desired behavior:

6622 gles-async

(Source)

I don't think there's much more that can be done from the encoding side.

Calinou commented 1 year ago

After further investigation, I've found that uncompressed images do in fact provide optimal performance

Do you have a branch for saving uncompressed images? This sounds very interesting, I'd like to test it :slightly_smiling_face:

However, there is still overhead compared to not having Movie Maker on, which is only resolved after also disabling texture downloading:

This is likely resolved by using hardware readbacks, which introduces a 2-3 frame delay. It should be possible to compensate this delay when writing the video, so that the audio is perfectly synchronized with the video like before.

Hardware readbacks are something that would be worth exposing to GDScipt too, as they make a lot of things possible that shaders alone can't do (e.g. for gameplay purposes).

myaaaaaaaaa commented 1 year ago

Do you have a branch for saving uncompressed images? This sounds very interesting, I'd like to test it :slightly_smiling_face:

See below for a proof-of-concept that you can download and git apply, obviously not written in a way suitable for inclusion into Godot:

 servers/movie_writer/movie_writer_pngwav.cpp | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/servers/movie_writer/movie_writer_pngwav.cpp b/servers/movie_writer/movie_writer_pngwav.cpp
index 12f338dfb6..74720c5879 100644
--- a/servers/movie_writer/movie_writer_pngwav.cpp
+++ b/servers/movie_writer/movie_writer_pngwav.cpp
@@ -139,11 +139,23 @@ Error MovieWriterPNGWAV::write_begin(const Size2i &p_movie_size, uint32_t p_fps,

 Error MovieWriterPNGWAV::write_frame(const Ref<Image> &p_image, const int32_t *p_audio_data) {
    ERR_FAIL_COND_V(!f_wav.is_valid(), ERR_UNCONFIGURED);
+   ERR_FAIL_COND_V(p_image->get_format() != Image::FORMAT_RGB8, ERR_UNCONFIGURED);
+
+   if (getenv("PPM")) {
+       WARN_PRINT_ONCE("Saving as PPM");
+
+       Vector<uint8_t> buffer = p_image->get_data();
+       Ref<FileAccess> fi = FileAccess::open(base_path + zeros_str(frame_count) + ".ppm", FileAccess::WRITE);
+       fi->store_string(String() + "P6\n" + itos(p_image->get_width()) + " " + itos(p_image->get_height()) + "\n255\n");
+       fi->store_buffer(buffer.ptr(), buffer.size());
+   } else {
+       WARN_PRINT_ONCE("Saving as PNG");

    Vector<uint8_t> png_buffer = p_image->save_png_to_buffer();

    Ref<FileAccess> fi = FileAccess::open(base_path + zeros_str(frame_count) + ".png", FileAccess::WRITE);
    fi->store_buffer(png_buffer.ptr(), png_buffer.size());
+   }
    f_wav->store_buffer((const uint8_t *)p_audio_data, audio_block_size);

    frame_count++;

It saves images in .ppm, which is reasonably well-supported among standard tools like ImageMagick or ffmpeg, although .bmp should be preferred in a real implementation due to its native integration with Windows.

PeterMarques commented 1 year ago

Well i think by the file extension it is a vocation call. I need to do this.

Anyway, this is the issue that gather all the data about improving video recording of engine performance. I need to understand more about this hardware readbacks.

myaaaaaaaaa commented 1 year ago

In other words, the overhead comes from a render pipeline stall rather than compressing the image data:

4555 gles-sync

See a proof-of-concept solution for this below that renders to a ringbuffer, which allows Movie Maker to download frames that were rendered ring_buffer_size = 5 frames ago. Combined with the PPM patch above, this should essentially eliminate all overhead with using Movie Maker.

Note that the use of RD::TEXTURE_USAGE_CPU_READ_BIT was a hack to disable synchronization (fencing?), which requires ring_buffer_size to be large enough to avoid a race condition where frames are downloaded while in the middle of being rendered. This is essentially the Vulkan equivalent of synchronizing threads via sleep(), and a proper fence should be used in a real implementation.

However, assuming there were no bugs in my ringbuffer implementation, removing RD::TEXTURE_USAGE_CPU_READ_BIT appears to result in texture_get_data() waiting for the entire pipeline to finish rather than just the texture copy, negating the benefits of the ringbuffer.

I don't know if the best approach is to update texture_get_data() to wait for just the texture copy, or set RD::TEXTURE_USAGE_CPU_READ_BIT and fence manually from the Movie Maker side (assuming either is possible).

diff --git a/servers/movie_writer/movie_writer.cpp b/servers/movie_writer/movie_writer.cpp
index 9df05ba94a..6348238534 100644
--- a/servers/movie_writer/movie_writer.cpp
+++ b/servers/movie_writer/movie_writer.cpp
@@ -184,7 +184,32 @@ void MovieWriter::add_frame() {

    RID main_vp_rid = RenderingServer::get_singleton()->viewport_find_from_screen_attachment(DisplayServer::MAIN_WINDOW_ID);
    RID main_vp_texture = RenderingServer::get_singleton()->viewport_get_texture(main_vp_rid);
-   Ref<Image> vp_tex = RenderingServer::get_singleton()->texture_2d_get(main_vp_texture);
+   RID main_vp_rd_texture = RenderingServer::get_singleton()->texture_get_rd_texture(main_vp_texture);
+
+   Vector2 vp_size = RD::get_singleton()->texture_size(main_vp_rd_texture);
+
+   constexpr int ring_buffer_size = 5;
+   static RID ring_buffer[ring_buffer_size];
+   static int ring_index = -1;
+   if (ring_index < 0) {
+       ring_index = 0;
+       RD::TextureFormat tf;
+       tf.width = vp_size.x;
+       tf.height = vp_size.y;
+       tf.format = RD::DATA_FORMAT_R8G8B8_SRGB;
+       tf.usage_bits = RD::TEXTURE_USAGE_CPU_READ_BIT | RD::TEXTURE_USAGE_CAN_COPY_FROM_BIT | RD::TEXTURE_USAGE_CAN_COPY_TO_BIT;
+       for (int i = 0; i < ring_buffer_size; i++) {
+           ring_buffer[i] = RD::get_singleton()->texture_create(tf, RD::TextureView());
+       }
+   }
+
+   RD::get_singleton()->texture_copy(main_vp_rd_texture, ring_buffer[ring_index], Vector3(), Vector3(), Vector3(vp_size.x, vp_size.y, 1), 0, 0, 0, 0);
+   ring_index = (ring_index+1)%ring_buffer_size;
+   if (Engine::get_singleton()->get_frames_drawn() < ring_buffer_size)
+       return;
+
+   Vector<uint8_t> data = RD::get_singleton()->texture_get_data(ring_buffer[ring_index], 0);
+   Ref<Image> vp_tex = Image::create_from_data(vp_size.x, vp_size.y, false, Image::Format::FORMAT_RGB8, data);

    RenderingServer::get_singleton()->viewport_set_measure_render_time(main_vp_rid, true);
    cpu_time += RenderingServer::get_singleton()->viewport_get_measured_render_time_cpu(main_vp_rid);
Calinou commented 1 year ago

Great work :slightly_smiling_face:

See a proof-of-concept solution for this below that renders to a ringbuffer, which allows Movie Maker to download frames that were rendered ring_buffer_size = 5 frames ago. Combined with the PPM patch above, this should essentially eliminate all overhead with using Movie Maker.

What about audio synchronization? I assume we should intentionally delay audio by a duration equivalent to 5 frames to ensure the saved image and sound stay in sync.

myaaaaaaaaa commented 1 year ago

What about audio synchronization? I assume we should intentionally delay audio by a duration equivalent to 5 frames to ensure the saved image and sound stay in sync.

Yes, audio samples should be stored inside the ringbuffer, so something like this pseudocode:

-   static RID ring_buffer[ring_buffer_size];
+   static struct {
+       RID frame_texture;
+       LocalVector<int32_t> audio_mix_buffer;
+   } ring_buffer[ring_buffer_size];

...

    RD::get_singleton()->texture_copy(main_vp_rd_texture, ring_buffer[ring_index].frame_texture, Vector3(), Vector3(), Vector3(vp_size.x, vp_size.y, 1), 0, 0, 0, 0);
+   ring_buffer[ring_index].audio_mix_buffer = audio_mix_buffer;

...

    Vector<uint8_t> data = RD::get_singleton()->texture_get_data(ring_buffer[ring_index].frame_texture, 0);
    Ref<Image> vp_tex = Image::create_from_data(vp_size.x, vp_size.y, false, Image::Format::FORMAT_RGB8, data);
+   audio_mix_buffer = ring_buffer[ring_index].audio_mix_buffer;