godotengine / godot-proposals

Godot Improvement Proposals (GIPs)
MIT License
1.07k stars 69 forks source link

Unit/integration testing: Testing graphical and UI code. #1760

Open bruvzg opened 3 years ago

bruvzg commented 3 years ago

Describe the project you are working on: Godot engine.

Describe the problem or limitation you are having in your project: Unit testing was introduced in godotengine/godot#40148, but currently there's no possibility to automatically test any GUI and rendering related code.

Related proposal: #1307 (testing contexts), #1533 (old tests had at least some rendering and UI tests)

Describe the feature / enhancement and how it helps to overcome the problem or limitation: Implement off-screen DisplayServer for use on headless CI, and make it compatible with software Vulkan (SwiftShader) / OpenGL (OSMesa) implementations to run on CI without GPU, and add testing framework context with active rendering pipeline (initialized display and rendering servers, and normal project main loop).

Describe how your proposal will work, with code, pseudocode, mockups, and/or diagrams:

  1. Testing framework render small, simple scenes for the isolated graphical features (materials, shaders, lighting/shadows e.t.c.) or reaction of the GUI elements to the simulated input event, with a fixed time steps for deterministic behavior.
  2. It takes screenshots at the predefined moments of time (for testing multiple rendering steps successively and testing particles/animations), and store them (probably downscaled, to avoid too big files and smooth out).
  3. Screenshots are compared (by the engine or external script) to the reference images, and marked for manual inspection if they substantial differences (by adding a thick, red border to the image for example).
  4. Screenshots are uploaded as the build artifact (as archive with the one image per test suite).

gr_test

If this enhancement will not be used often, can it be worked around with a few lines of script?: It can be used as part CI to detect rendering, physics and GUI regressions, and can be used to quickly test specific hardware or driver versions for rendering issues (the same context should be usable with normal DisplayServers as well).

Is there a reason why this should be core and not an add-on in the asset library?: It should be possible to achieve this with module or GDScript project, but probably better to have testing related stuff in the core for cleaner CI configs and to avoid duplicate code in multiple test projects.

Xrayez commented 3 years ago

Test contexts

The minimal testing context was introduced in godotengine/godot#40980 without rendering capabilities, but has been working alright for unit testing specifically so far.

The way I see it, it may be feasible to just introduce another integration test context manually. I've previously attempted to create test contexts using doctest's dynamic filtering in https://github.com/Xrayez/godot/tree/test-contexts, but it may be too complex to maintain, and error-prone.

The main challenge is being able to register setup/teardown methods with doctest, which is not a feature of doctest (without code duplication). The suggested setup/teardown mechanism in doctest is to use SUBCASEs, but I think that works better for avoiding duplication in a test case itself, and not really the test environment.

The entry point for unit and integration testing could be rewritten to accept things like:

This way, I think it would be still possible to use doctest for those (like godotengine/godot#42938). It means that the entry point would go through additional interface layer, so to speak.

This kind of setup would also help #1533 because it means no compatibility breakage would have to be done in the first place. But godotengine/godot#40148 didn't preserve compatibility with the old tests.

Graphical and UI code testing

I think testing graphical and UI code requires a MainLoop to be running. It's totally possible to feed input events via code, as seen in Xrayez/godot-testbed#5:

extends "res://addons/gut/test.gd"

# https://github.com/godotengine/godot/issues/32597

class TabContainerGuiInputCrash extends TabContainer:

    var ev = InputEventMouseButton.new()

    func _ready():
        var pm := PopupMenu.new()
        set_popup(pm)
        pm.queue_free()

        yield(get_tree(), "idle_frame")
        yield(get_tree(), "idle_frame")
        yield(get_tree(), "idle_frame")

        ev.pressed = true
        ev.button_index = BUTTON_LEFT
        ev.button_mask = BUTTON_LEFT
        ev.position = Vector2(0, 14)

        Input.parse_input_event(ev)

        yield(get_tree(), "idle_frame")
        yield(get_tree(), "idle_frame")

        Input.parse_input_event(ev)
        Input.parse_input_event(ev)

var container

func setup():
    var gut_window = get_parent().get_node('Gut')
    gut_window.hide() # need to hide to properly detect input event

    container = TabContainerGuiInputCrash.new()
    add_child(container)

func test_tab_container_gui_input():
    yield(yield_for(1.0, 'Hopefully no crash happens.'), YIELD)
    assert_true(true, "No crash, great!")

func teardown():
    container.queue_free()

The --fixed-fps and --disable-render-loop command-line options could potentially be used to speed up simulation and controlling the rendering loop via code with RenderingServer.force_draw(). See also godotengine/godot#43260, I'm not sure whether those methods would be actually useful for this.

3. Screenshots are compared (by the engine or external script) to the reference images, and marked for manual inspection if they substantial differences (by adding a thick, red border to the image for example).

doctest could be used for this as for GDScript integration tests #1429, but may be overkill, so perhaps an extra step would be indeed required to do this.

But in theory, all this could be done from within a Godot project running on CI. This is where testing frameworks like GUT shine, in my opinion. For instance, I've been successfully running unit tests in Goost, but we still need a way to render stuff on CI.

c0d1f1ed commented 3 years ago

I noticed at https://bruvzg.github.io/using-godot-with-swiftshader-software-vulkan-emulation.html that you had to increase SwiftShader's bound descriptor set limit to 16 to get it to work with Godot. I'm curious why that's required. Currently only just over half of the Vulkan drivers support 16 or more: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxBoundDescriptorSets. While this metric does not take deployments into account, it still seems to me that important classes of GPUs only support 8, or 4, bound descriptor sets.

I don't mind upstreaming this change to permanently increase it, but I'd love to understand how an engine like Godot uses more than 4 descriptor sets, and what might be a good balance. It seems like no GPU has 16, so I guess 8 would already suffice? Any significant advantage from increasing it to 32? Thanks!

bruvzg commented 3 years ago

I noticed at https://bruvzg.github.io/using-godot-with-swiftshader-software-vulkan-emulation.html that you had to increase SwiftShader's bound descriptor set limit to 16 to get it to work with Godot.

Godot's RenderingDeviceVulkan supports up to 16 descriptor sets, but 6 should be fine for the current version.

Edit: Actually it might work with 4 since https://github.com/godotengine/godot/pull/44175 was merged.

bruvzg commented 3 years ago

Actually it might work with 4 since godotengine/godot#44175 was merged.

I have checked the current master of Godot, and it's working with a limit of 4 descriptor sets, so this change is not necessary anymore.

fire commented 3 years ago

Did anyone try robotframework to provide visual tests.

Need to evaluate:

https://robotframework.org/#documentation

We can use robotframework and pick one of the available frameworks that support vulkan.

fire commented 3 years ago

I have made a prototype using robotframework.

This sample does two things:

https://github.com/fire/robotframework-godot

fire commented 3 years ago

Added a video recording task.

Using vmaf we were able to get a score of 97.430362 for the same video and 65.083790 for different videos.

.\data\ffmpeg-N-100672-gf3f5ba0bf8-win64-lgpl-shared-vulkan\bin\ffmpeg.exe -i default_1.webm -pix_fmt yuv420p default_1.y4m
.\data\ffmpeg-N-100672-gf3f5ba0bf8-win64-lgpl-shared-vulkan\bin\ffmpeg.exe -i godot_1.webm -pix_fmt yuv420p godot_1.y4m
copy godot_1.y4m to godot_2.y4m
.\data\vmaf.exe --reference .\godot_1.y4m --distorted .\default_1.y4m
.\data\vmaf.exe --reference .\godot_1.y4m --distorted .\godot_2.y4m 

Not written a script for it yet, but also able to take a screenshot and run comparison stats and have a visual diff. Used a single executable built reg-cli. image

Calinou commented 2 years ago

I have a proof of concept that uses Nut.js here: https://github.com/Calinou/godot/tree/add-editor-ui-tests/misc/ui_tests

For the editor, I don't know what kind of "workflows" would be best to apply within the automated tests though. Creating a basic project automatically, running it then stopping it would be useful, but it wouldn't be testing a whole lot of functionality.

Also, I haven't figured out how to run it on a headless server (with Xvfb + Lavapipe/SwiftShader) yet.

fire commented 2 years ago

I was using robot framework because it can run the editor using image recognition to find buttons and then execute the process under swiftshader.

@nikitalita worked on swiftshader cicd integration.

Edited:

I evaluated nut.js it doesn't seem to have support for everything. https://robotframework.org/#resources

nikitalita commented 1 year ago

My initial attempts at visual regression testing has revealed that output can vary wildly between video cards and even different driver versions. It's not really noticeable to the human eye, but a 1-to-1 comparison or even a fuzzy comparison >95% of frame captures will fail if the test environment isn't set up the exact same for the baseline and the subsequent tests (preferably the exact same machine). @myaaaaaaaaa have you encountered this?

Calinou commented 1 year ago

My initial attempts at visual regression testing has revealed that output can vary wildly between video cards and even different driver versions. It's not really noticeable to the human eye, but a 1-to-1 comparison or even a fuzzy comparison >95% of frame captures will fail if the test environment isn't set up the exact same for the baseline and the subsequent tests (preferably the exact same machine). @myaaaaaaaaa have you encountered this?

See How (not) to test graphics algorithms. A dssim check should be able to work out decently if it has a large enough threshold, but in general, it's recommended to have a few "complete" test images over a lot of "partial" tests covering isolated features. This may be counter-intuitive, but it makes checking for regressions a lot less time-consuming. We should be careful about "alarm fatigue" in general when it comes to this kind of regression testing, as it's an easy trap to fall into.

mariomadproductions commented 2 months ago

I wonder if this would be useful for "whole game" tests. The developer would record the inputs, RNG seed and movie for a playthrough. The movie or perceptual hash of the movie would be stored, and then the inputs and RNG seed would be used to replay the movie and compare with the developer's playthrough. This could be useful to automatically test if a game still functions correctly when ported to another platform/godot version. A self-test option could also be included in published builds, for players to use. For the self-test, as the full thing might take too long for large and performance-heavy games, there could just be an option for a cut-down playthrough, or playthrough of a test level/test suite.

Calinou commented 2 months ago

I wonder if this would be useful for "whole game" tests.

Godot's physics engines are not determinstic, so this wouldn't be useful unless your game doesn't rely on the physics engine at all (and uses its own deterministic physics implementation).

Subtle differences in rendering (due to different GPU hardware or driver version) can also be introduced, which would cause the hash to be ivnalid.

mariomadproductions commented 2 months ago

Makes sense, regarding the physics.

For the differences in rendering, I think perceptual hashes are designed to allow leeway for small changes. And I'd think you'd want to detect large differences in a game when using different GPU/driver configurations?

But maybe this should be a separate discussion thread, actually.

Calinou commented 2 months ago

For the differences in rendering, I think perceptual hashes are designed to allow leeway for small changes. And I'd think you'd want to detect large differences in a game when using different GPU/driver configurations?

Yes, tools like dssim can be used to calculate a similarity score between two images. Tweaking the value threshold is an art in itself though, and you need to record your videos using lossless compression which results in huge files.