godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine
https://godotengine.org
MIT License
89.26k stars 20.23k forks source link

Queue free on node with script running async coroutine with tree timer still lets coroutine run one last time on next frame #93608

Open hsandt opened 2 months ago

hsandt commented 2 months ago

Tested versions

System information

Godot v4.2.1.stable - Ubuntu 22.04.4 LTS 22.04 - X11 - Vulkan (Mobile) - dedicated NVIDIA GeForce GTX 860M (nvidia; 535.161.07) - Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz (8 Threads)

Issue description

Calling queue_free on node, or a parent of a node that is currently running a coroutine (async method) that awaits for get_tree().create_timer(...) with argument process_in_physics=true will, under a certain timing, still lets the coroutine run one last time on the next frame.

I expect queue_free to delay node destruction to the end of the frame, but not at late as the end of the next frame.

To repro the bug, I had to use await get_tree().create_timer(0, false, true).timeout inside the coroutine: I found await get_tree().physics_frame not to repro the bug, and that process_in_physics=true was needed to match process_in_physics=true in the create_timer mentioned above. Or, you can use create_timer with process_in_physics=false in both the main script and coroutine to match, it will also cause the bug.

Some timer durations work, but not others, I'm not sure why, maybe it's related to the main and sub coroutines having to get in sync at the end of the frame.

On main code side (calling queue_free), both await get_tree().physics_frame and create_timer worked to repro the bug.

Finally, free doesn't cause the bug but can obviously cause other issues when dealing with more complex nodes used elsewhere.

Real-world case

My enemy AI is run via an async method, which is very convenient to define an action flow over time, but has the disadvantage of not being easily cancellable. Fortunately, destroying a node, or a parent node of the one with the script running the coroutine (at least when not static) will stop any running coroutine. The issue is just that it's not frame-precise and therefore I can have remarkable leaks of the coroutine even after a manual game restart consisting in clearing all entities manually (such as a projectile being spawned out of the blue on the first frame after restart).

Using process_in_physics=false should be the minimal fix but I really want to sync my AI logic and actions with physics process to be safe. So my workaround was to make sure to clear all projectiles at least one frame after clearing the enemies.

Steps to reproduce

  1. Open MRP demo scene, made of a Main node and an AsyncRun node
  2. Play the demo scene and observe output

You should see LOOP 3 times, the last one happening after queue_free:

ASyncRun: start LOOP AsyncRun: _physics_process frame 0 Main: _physics_process frame 0 ASyncRun: start LOOP AsyncRun: _physics_process frame 1 Main: _physics_process frame 1 Main: called async_run_node.queue_free(), AsyncRun coroutine should stop by next frame ASyncRun: start LOOP ... despite node being already queued for deletion! Main: _physics_process frame 2 Main: _physics_process frame 3 Main: _physics_process frame 4

  1. Swap the order of Main and AsyncRun
  2. Repeat step 2: note how the order of physics_process frames counter debug changes in the Output log, but eventually the same problem appears
  3. Variant: in AsyncRun.gd, replace create_timer 3rd argument process_in_physics to false. In main.gd, replace the secondawait get_tree().physics_frame (after run_async call) with await get_tree().replace (0.017, false, false).timeout (if you uncomment the existing line, remember to replace the 3rd argument with false there too) => also reproduces the bug

Simplified code if you prefer running it in your own project:

# AsyncRun.gd
extends Node2D

var frames: int = 0

func run_async():
    # the loop is merely to make it easier to spam LOOP after every timer
    while true:
        print("ASyncRun: start LOOP")
        if is_queued_for_deletion():
            print("... despite node being already queued for deletion!")

        # 0, 1/60. (0.0166...) and 2/60. (0.0333...) seem to trigger the bug,
        # but not 3/60. nor 4/60.
        # you can also pass process_in_physics=false but then
        # you must also use create_timer in main.gd, also with
        # process_in_physics=false
        await get_tree().create_timer(0, false, true).timeout

func _physics_process(_delta):
    if frames < 5:
        print("AsyncRun: _physics_process frame %d" % frames)
        frames += 1
# main.gd
extends Node2D

@onready var async_run_node = $"../AsyncRun"

var frames: int = 0

func _ready():
    # just to be safe on start
    await get_tree().physics_frame

    async_run_node.run_async()

    await get_tree().physics_frame
    # or
    # await get_tree().create_timer(0.017, false, true).timeout

    async_run_node.queue_free()
    print("Main: called async_run_node.queue_free(), AsyncRun coroutine should stop by next frame")

func _physics_process(_delta):
    if frames < 5:
        print("Main: _physics_process frame %d" % frames)
        frames += 1

Tree hierarchy:

Minimal reproduction project (MRP)

v4.2.2 - Queue free on node fails to stop coroutine last run after tree timer.zip

4X3L82 commented 2 months ago

Emphasis mine:

My enemy AI is run via an async method, which is very convenient to define an action flow over time, but has the disadvantage of not being easily cancellable. I really want to sync my AI logic and actions with physics process to be safe

Don't you want it both ways? First you want async AI logic, but at the same time you want to sync AI logic? Comes across as wanting to eat your cake and have it too.

Plus, if you want things to be "physics process"-safe, shouldn't you put all your AI logic into _physics_process()? Personally, that sounds like the wrong place (overkill in frequency, not to mention enemy logic isn't a "physics" thing), but you are the one wanting it to be "physics process"-safe.

ajreckof commented 2 months ago

This is the log I have (cut to the usefull part), if you do not have something similar then you can ignore whatever I'm stating below:

AsyncRun: _physics_process frame 4
Main: _physics_process frame 4
ASyncRun: start LOOP
Main: called async_run_node.queue_free(), AsyncRun coroutine should stop by next frame
AsyncRun: _physics_process frame 5
Main: _physics_process frame 5
ASyncRun: start LOOP
... despite node being already queued for deletion!
Main: _physics_process frame 6

The signal physics_frame is called before the associated function _physic_process. This means where you thought this was happening at the end of the previous frame it was happening at the start of the next frame it then went on doing physic_process for all nodes and finally doing a turn of all timers tied to physic process (which is where you have the call being done while the node is queued for deletion) finally it would free all associated Node.

If you have a different log then it is either something that has already been fixed (tested on master) or something that is platform dependent (tested on MacOS)

hsandt commented 1 month ago

Emphasis mine:

My enemy AI is run via an async method, which is very convenient to define an action flow over time, but has the disadvantage of not being easily cancellable. I really want to sync my AI logic and actions with physics process to be safe

Don't you want it both ways? First you want async AI logic, but at the same time you want to sync AI logic? Comes across as wanting to eat your cake and have it too.

Plus, if you want things to be "physics process"-safe, shouldn't you put all your AI logic into _physics_process()? Personally, that sounds like the wrong place (overkill in frequency, not to mention enemy logic isn't a "physics" thing), but you are the one wanting it to be "physics process"-safe.

The second sync is about syncing enemy behavior to physics frame as opposed to render frames. I still use the async coroutine system for convenience (otherwise I need to write a giant match statement inside a big physics_process method to identify the different phases of the enemy behavior, with a bunch of custom timers to advance to next phase - or use a full-fledged system by a Behavior Tree). await get_tree().physics_frame for instance uses await but it explicitly waits for a physics frame. The idea is to avoid missing on frames on slow machines that the user could exploit to make the AI "dummier".

hsandt commented 1 month ago

The signal physics_frame is called before the associated function _physic_process. This means where you thought this was happening at the end of the previous frame it was happening at the start of the next frame it then went on doing physic_process for all nodes and finally doing a turn of all timers tied to physic process (which is where you have the call being done while the node is queued for deletion) finally it would free all associated Node.

Yes it would make sense if the async code happens at the start of the next frame. I guess there's not much to do about it then...

One workaround I see is to check for if is_queued_for_deletion(): after every await, but it's quite cumbersome. I do remember that in C# (it was for Unity but probably applicable to Godot Mono) someone suggested to add Cancellation Tokens everywhere, + a helper method to pass it to every waiting call, which made it more convenient (no need for an extra line).

Right now this issue only happens when I restart the game at a particular frame which would spawn an extra projectile (see my "Real-world case"), so I just waited an extra frame before clearing the projectile...

Or, I could add a more game-specific fix, such as setting the death flag immediately and checking it before any character action (for instance, the script managing Projectile spawn could refuse to spawn if owner is dead).

EDIT: Another workaround is to mix and match physics process and process since it will force the other call to wait for next frame... but it's not very intuitive and may cause non-physics issues mentioned above.

kleonc commented 1 month ago

The output printed within the MRP is kinda misleading as it doesn't show what's the physics frame when queue_free() is being called. Doing so shows that the node is indeed freed at the end of the same frame (or rather that it doesn't process during the next frame). Also the custom frame counters are misleading as well, because they are incremented somewhere in the middle of the engine's physics frame processing (within Node._physics_process method). Meaning they don't indicate the actual in-engine frame.

Changed scripts from the MRP

```gdscript # AsyncRun.gd extends Node2D var frames: int = 0 func run_async(): while true: print("[%d][%d] ASyncRun: start LOOP" % [Engine.get_physics_frames(), frames]) if is_queued_for_deletion(): print("[%d][%d] ... despite node being already queued for deletion!" % [Engine.get_physics_frames(), frames]) await get_tree().create_timer(0, false, true).timeout func _physics_process(_delta): if frames < 5: print("[%d][%d] AsyncRun: _physics_process" % [Engine.get_physics_frames(), frames]) frames += 1 ``` ```gdscript # main.gd extends Node2D @onready var async_run_node = $"../AsyncRun" var frames: int = 0 func _ready(): print("[Engine.get_physics_frames()][self.frames]") await get_tree().physics_frame async_run_node.run_async() await get_tree().physics_frame async_run_node.queue_free() print("[%d][%d] Main: called async_run_node.queue_free(), AsyncRun coroutine should stop by next frame" % [Engine.get_physics_frames(), frames]) func _physics_process(_delta): if frames < 5: print("[%d][%d] Main: _physics_process" % [Engine.get_physics_frames(), frames]) frames += 1 ```

Output (v4.2.2.stable.official [15073afe3]):

[Engine.get_physics_frames()][self.frames]
[0][0] ASyncRun: start LOOP
[0][0] AsyncRun: _physics_process
[0][0] Main: _physics_process
[0][1] ASyncRun: start LOOP
[1][1] Main: called async_run_node.queue_free(), AsyncRun coroutine should stop by next frame
[1][1] AsyncRun: _physics_process
[1][1] Main: _physics_process
[1][2] ASyncRun: start LOOP
[1][2] ... despite node being already queued for deletion!
[2][2] Main: _physics_process
[3][3] Main: _physics_process
[4][4] Main: _physics_process

Output (v4.3.rc2.official [3978628c6]):

[Engine.get_physics_frames()][self.frames]
[1][0] ASyncRun: start LOOP
[1][0] AsyncRun: _physics_process
[1][0] Main: _physics_process
[1][1] ASyncRun: start LOOP
[2][1] Main: called async_run_node.queue_free(), AsyncRun coroutine should stop by next frame
[2][1] AsyncRun: _physics_process
[2][1] Main: _physics_process
[2][2] ASyncRun: start LOOP
[2][2] ... despite node being already queued for deletion!
[3][2] Main: _physics_process
[4][3] Main: _physics_process
[5][4] Main: _physics_process

Engine.get_physics_frames() returning different values between the versions is caused by #94039.


Here's the relevant source: https://github.com/godotengine/godot/blob/4359c28feafb39dd0157744a52f6746d49a38c37/scene/main/scene_tree.cpp#L475-L504

Regarding MRP:


Given that:

I don't see any bug in here, seems like intended behavior. :thinking:

I'm not sure if/what docs would need to be improved (suggestions of course welcomed :slightly_smiling_face:).