NavigationAgent2D doesn't find shortest past with NavigationLink2D of cost zero

godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine

https://godotengine.org

MIT License

90.99k stars 21.16k forks source link

NavigationAgent2D doesn't find shortest past with NavigationLink2D of cost zero #84388

Closed sandygk closed 11 months ago

sandygk commented 1 year ago

Godot version

Godot v4.1.2.stable

System information

Godot v4.1.2.stable - Windows 10.0.22621 - Vulkan (Forward+) - dedicated NVIDIA GeForce RTX 3060 Laptop GPU (NVIDIA; 31.0.15.3699) - AMD Ryzen 9 6900HS with Radeon Graphics (16 Threads)

Issue description

I have a simple scene with a NaviationAgent2D and a NavigationLink2D with travel cost 0 to simulate a teleportation portal, and a Target. The path the agent finds is a straight line to the target, which is suboptimal, since going through the link/portal is shorter because it has a travel cost of 0.

Steps to reproduce

Run the minimal reproduction project and notice the path that it finds is not the correct one:

Minimal reproduction project

nav-agent-2d-bug.zip

nav-agent-2d-bug-different-polygons.zip

smix8 commented 1 year ago

The path the agent finds is a straight line to the target, which is suboptimal

I get why you see it that way but from the perspective of the involved pathfinding this result is perfection. It found that both positions are on the same polygon so moved in a straight line to the target position.

That setup will not work with pure A* pathfinding. You would need to query a link map first. It would not even work with a different custom pathfinding algorithm, e.g. Dijkstra, because both start and target polygon are the same, so the result is always a straight path line without a link map that gets queried first.

A pathfinding uses a greedy heuristic with euclidean distance towards the target position, the heuristic part is core why A is so efficient but it works against it to find alternative and objectively "better" path options that do not follow the direction to the target position. Since the path to the target is unobstructed and on the same polygon it finds a direct path corridor and returns that path. At that point the job is done as far as the A pathfinding is concerned and it does an early exit, it does not search any other polygons for "better paths". So from the perspective of A pathfinding this is a perfect optimal path.

Links with A will only be used reliably when they are the only connection between navigation mesh islands or the polygon surface cost of walking with polygons in the direction of the target is far higher than the link. A link that is left or right of the direction will only be used by chance with enough obstruction in the direction and a link that is the opposite direction of the target will only be found when all other polygon search options are exhausted. That is the nature of greedy A pathfinding.

To create the mentioned link map you would map all your link connections with precalculate path cost in a dictionary and query that first, then compare it to the actual "normal" path and see what is better and set your agent on that path instead. You can look at my old Godot 3 JumpLinks addon to see how it is basically done with GDScript. https://github.com/smix8/Godot_3D_Navigation_Jump_Links. You can see in the demo video https://www.youtube.com/watch?v=jijXbmYJYfM that the pink agent often switches directions suddenly to use links because when it updates the path it first looks up if the direct path or the cached path in the link map is shorter.

sandygk commented 1 year ago

I am not sure how the mesh is transformed into a graph to it pass to the pathfinding algorithm, the simplest way I can think of, is to pick points from the mesh to create the nodes (let's say at 1px distance for instance), and if 2 points are next to each other, then we create an edge to connect them. If this is the case, then the above statement is incorrect:

That setup will not work with pure A* pathfinding. You would need to query a link map first. It would not even work with a different custom pathfinding algorithm, e.g. Dijkstra, because both start and target polygon are the same, so the result is always a straight path line without a link map that gets queried first.

In the case of Dijkstra, is guaranteed to return the optimal path as long as the weights (or costs) are not negative, which the engine enforces with travel cost >= 0.

In the case of A* it is also guaranteed to return the shortest path as long as the heuristic is admissible, which I don't see why we would pick one that isn't.

Thanks for the help, I'll definitely take a look at your suggestion, but I am pretty sure this is a bug in the pathfinding algorithm that we should fix in the Engine

sandygk commented 1 year ago

@smix8, I looked at the code and I see what you mean, the code assumes that if both the start and end points are in the same convex polygon then the solution is the straight line, but this assumption is incorrect as soon as you add nav links, so the code need to be updated, otherwise it doesn't return the shortest path as illustrated in the example above.

But there is also another issue going on, that has nothing to do with being on the same polygon, see the following image:

The agent still ignores the link, even though the path it takes is way longer, in this case, the start and end point are not in the same polygon.

It only detects the link when the agent is really close to the link, like in the following image:

I added a minimal repo named nav-agent-2d-bug-different-polygons to illustrate this example

sandygk commented 1 year ago

@smix8, I just looked at the code and I understand what you mean by:

That setup will not work with pure A* pathfinding.

Even though theoretically A* returns the shortest path when the heuristic function is admissible, we are using the Euclidean distance as a heuristic in our implementation, and this doesn't really work with links since links pretty much make the Geometry not Euclidean, and the heuristic is no longer admissible. Admissible just means heuristically estimated distance ≤ the actual distance, which is obviously false with links of cost 0/

But I am still pretty sure it should work with Dijkstra, I don't see why it wouldn't.

I would investigate to see if we can use any other heuristic that is still admissible for a geometry with links.

If the heuristic function ( h ) in A always returns 0, then the formula for ( f(n) ) which is ( g(n) + h(n) ) simplifies to just ( g(n) ), which is the actual cost from the start node to node ( n ). This makes A behave exactly like Dijkstra's algorithm, exploring nodes based solely on the known cost from the start node without any guidance towards the goal.

So I'll make that change to turn A* into Dijkstra in my local branch and check if it works with that change. Just to make sure.

sandygk commented 1 year ago

Yup, I modified the A* to turn it into Dijkstra in my local version of Godot, and that basically solves the issue, obviously it takes longer to compute, but for a small case like this is insignificant:

The modifications to turn A* into Dijkstra are very simple, for the most part is just removing this line:

https://github.com/godotengine/godot/blob/5ee983188de97ae027f9b9c1443438063f708a7e/modules/navigation/nav_map.cpp#L364

That would basically set the heuristic function to 0

We also need to fix the corner case where the start and end are in the same polygon, but that seems relatively simple.

What I would suggest is to add Dijkstra as an alternative Pathfinding Algorithm since A* doesn't really work well with links, but Dijkstra does.

We even have the UI pretty much ready for that:

I can work on the PR to add that feature. But I probably need to create a feature request first, right?

smix8 commented 1 year ago

Dijkstra will fix your specific test project because there are only ~5 polygons in the direction of the link and 10+ in the direction to the target position. It will not fix the issue in general.

It would still be a mostly circular exploration pattern of polygons, so when link and target position are placed at mostly equal polygon numbers the link gets overlooked because the pathfinding will do an early exist as soon as the target position polygon is found.

So what also needs to change is on top of the Dijkstra pathfinding option the early exit also needs to be made optional / with a polygon search cap. E.g. -1 is no cap, the search will go until it runs out of polygons, and any >0 positive number will be the number of additional polygons searched after a path search finds the polygon with the target position.

If you want to work and test around with this be my guest. In general it is not that many of those issues are not known but more that they have been rescheduled to a later point to be paired with planned reworks and performance improvements as battles have to be picked carefully around a navigation team the size of 0.5.

Since Godot 4 is release one constrain is that changes that worsen the pathfinding for everyone to fix specific niche issues are only acceptable with general performance gains for a net positive. Especially the begin_poly == end_poly branch is very hot, as user projects love to spam path searches while staying on the same polygon forever.

sandygk commented 1 year ago

It would still be a mostly circular exploration pattern of polygons, so when link and target position are placed at mostly equal polygon numbers the link gets overlooked because the pathfinding will do an early exist as soon as the target position polygon is found.

I see what you mean and it makes sense. But it's still not 100% correct, look at the following example using the above-mentioned Dijkstra implementation, it uses the link even though there are many more polygons in that path. Maybe you were thinking it will behave like BFS, but keep in mind Dijkstra still takes the cost of traveling to a polygon into consideration.

But your point is still valid, as soon as Dijkstra finds the end polygon, it finishes and that is incorrect (for a simple fix for that, check my next comment.)

If you want to work and test around with this be my guest. In general it is not that many of those issues are not known but more that they have been rescheduled to a later point to be paired with planned reworks and performance improvements as battles have to be picked carefully around a navigation team the size of 0.5.

I'll give it a try, I'll post here if I make progress.

Since Godot 4 is release one constrain is that changes that worsen the pathfinding for everyone to fix specific niche issues are only acceptable with general performance gains for a net positive. Especially the begin_poly == end_poly branch is very hot, as user projects love to spam path searches while staying on the same polygon forever.

That makes sense, but considering this will be a different pathfinding algorithm and the user has to opt in, it won't really affect any user that doesn't select it

sandygk commented 1 year ago

Dijkstra will fix your specific test project because there are only ~5 polygons in the direction of the link and 10+ in the direction to the target position. It will not fix the issue in general.

I think Dijkstra still works, it's just that we need to implement it correctly and be clever about it.

Here is an idea: What if we create polygons for the start and end points. Basically the polygons will have an area of 0, and they will be the same point repeated 3 times, similar to the hack we use for turning links into polygons. I am pretty sure this will take care of all the issues we mentioned and seems pretty straightforward to implement.

It takes care of the edge case where both points are in the same Polygon, since now that is never going to happen if we use these tiny polygons as the end and start polygons. It also guarantees that if the algorithm reaches the end polygon, it actually reaches the end point

Thoughts?

sandygk commented 1 year ago

Another approach could be to modify the A* heuristic to allow for links.

Specifically, the issue arises with reachable Non-Euclidean links that enable faster travel than what's possible in Euclidean geometry, making the Euclidean distance heuristic inadmissible. During link polygon creation, we can identify and mark Non-Euclidean links. When listing reachable polygons, we then check for reachable Non-Euclidean links and only then apply the modifications below.

Utilize the start and end dummy polygons as previously mentioned.
Adjust the heuristic to account for Non-Euclidean links, reformulated as:

$\ h(n) = \min(d(n,\text{goal}), \min_{l \in \text{Non-Euclidean links}} (d(n,l) + \text{entry cost}(l) + d(l,\text{goal})) ) \$

Here, $d(n,\text{{goal}})$ and $d(n,l)$ are computed using Euclidean distance. However, $d(l,\text{{goal}})$ must consider potential traversal through other Non-Euclidean links, requiring precomputation.

Precompute shortest distances between each Non-Euclidean link and the goal using Dijkstra's algorithm:
- Construct a graph where nodes represent Non-Euclidean link endpoints and the goal.
- Connect all link exits to all link entries, with the cost being the sum of the Euclidean distance plus the entry cost.
- Also connect all link exits to the goal, with the cost being the Euclidean distance from the exit to the goal.
- Connect all entries to their respective exits, with the cost being the product of the link length and its distance.
- Execute Dijkstra's algorithm from the goal to each entry, precomputing $d(n,\text{{goal}})$ for each reachable Non-Euclidean link.

Thoughts?

smix8 commented 1 year ago

The hack of the navigation links with their mini polygon is not a good example to follow, it only exists because at that time things were under Godot 4.0 feature freeze and release pressure. There was some informal agreement and merge condition to fix it after release for 4.1 but then the link contributor disappeared. The navigation edge connection calculation did that part better in that they did not add additional polygons (but they did other performance things worse).

Hacking with a corrupted zero surface polygon is not the way to go. The navigation map would throw such things rightfully back at users. The navigation mesh system is build around convex polygons with an expected surface area that can also be queried by users.

There was some discussion about adding waypoints to the NavigationServer to migrate the AStar classes to the server. This would also add a "point" alternative to polygons which could also be used for link positions but no work has started on this.

Pre-computing could be an additional option for static scenes but there also need to be a general solution that works for runtime changes. You need to consider that users will commonly do stuff like this in their release projects: linkoveruse This requires a hierarchical approach so chunks of the map can be computed in the background which is currently not an option with how the navigation map syncs its map state.

sandygk commented 1 year ago

The hack of the navigation links with their mini polygon is not a good example to follow, it only exists because at that time things were under Godot 4.0 feature freeze and release pressure. There was some informal agreement and merge condition to fix it after release for 4.1 but then the link contributor disappeared. The navigation edge connection calculation did that part better in that they did not add additional polygons (but they did other performance things worse).

I will take a look at the navigation edge connection calculation, and try to find the discussion where the informal agreement to fix it was discussed to see what is the alternative. To be honest I don't find the solution too bad, I agree it feels hacky, but A* is a graph algorithm, and we are working with polygons and links, so there has to be data transformation to be able to feed them to the algorithm. I feel like we are bound to do tricks like that just due to the nature of the problem

Hacking with a corrupted zero surface polygon is not the way to go. The navigation map would throw such things rightfully back at users. The navigation mesh system is build around convex polygons with an expected surface area that can also be queried by users.

Maybe I didn't explain myself properly, but the zero surface polygon around the end point will only happen inside the get_path function locally, as a mechanism to ensure that once we reach the last polygon we actually reached the goal and we can stop. We actually only need to do it for the goal, we don't even need it for the start point. The current implementation assumes that if we reach the goal's polygon we can go directly to the goal, this assumption is incorrect and a bug, as has been pointed out. Wrapping the end point into a 0 surface polygon is a simple way to achieve this, and it has no repercussion in the system as a hole because this polygon won't be added to the navigation map, it will be just locally store in the get_path function and cleared out once the function returns.

So what also needs to change is on top of the Dijkstra pathfinding option the early exit also needs to be made optional / with a polygon search cap. E.g. -1 is no cap, the search will go until it runs out of polygons, and any >0 positive number will be the number of additional polygons searched after a path search finds the polygon with the target position.

The solution of keep going until the queue is empty is extremely inefficient (unless I am missing something) since we will be then iterating over all the paths from start to goal, this has an exponential time complexity. This is even worse than Dijkstra, and the whole point of having a heuristic is to avoid looking at certain paths, but with this solution we will look at them anyway.

There was some discussion about adding waypoints to the NavigationServer to migrate the AStar classes to the server. This would also add a "point" alternative to polygons which could also be used for link positions but no work has started on this.

If you can point me out to these discussions, I will take a look. Honestly, I don't even know what you mean by waypoints and how can we turn polygons into waypoints in a manner that is more or equally efficient that what we have. Also, it seems to me that's a complete rewriting of the feature, which is not what this bug is about, this is about fixing the problems with the navigation system not returning the shortest paths, I rather fix the system that we have that rewrite it in the hopes that the new rewrite will fix the issues that is has. But I'll be glad to take a look at that if you can point me in the right direction.

Pre-computing could be an additional option for static scenes but there also need to be a general solution that works for runtime changes. You need to consider that users will commonly do stuff like this in their release projects:

Not sure what you mean by that, if you are talking about when I said:

Precompute shortest distances between each Non-Euclidean link and the goal using Dijkstra's algorithm

Keep in mind these pre-computation will also happen inside the get_path function, so it will still be dynamic, it has to since both the start and goal points are dynamic. It's precomputed, so we can compute it once and use it every time we use the heuristic

sandygk commented 1 year ago

I just realized that my way of thinking about Euclidean vs Non-Euclidean is not the right framework. Euclidean distance assumes constant travel cost, the problem here is not with links being Non-Euclidean, the problem appears when we have different travel costs involved.

Here is an example:

Let's say the red area is mud and really slow to walk (high travel cost) and the green area is grass and fast to walk (low travel cost), and the purple lines are links to connect both areas.

Our current heuristic won't find the proper path which is walking through the grass, but not because the link is not Euclidean, the link could have the same travel cost as the mud, and it will still be better to use the grass. So the suggestion I made in this comment won't really fix the issue for the general case.

Here is the line where we add the cost of the heuristic

cost += (np->entry.distance_to(end_point) * np->poly->owner->get_travel_cost());

The problem here is with this part: np->poly->owner->get_travel_cost()) we are using the travel cost of the current polygon, but that is incorrect, as show in the image above, there could be a better travel cost. This turns the heuristic into not admissible and therefore doesn't guarantee actual shortest path, but just a path.

There is a simple and straightforward way to fix this, we can use the minimum reachable travel cost. That is, from all the regions that we can reach, we pick the lowest travel cost. This is straightforward to fix and works for all scenarios guaranteeing returning the shortest path.

sandygk commented 1 year ago

So to summarize, there are the two issues I found:

The heuristic used is non-admissible: This is a requirement to find the shortest path. A quick solution for that is, in this line: cost += (np->entry.distance_to(end_point) * np->poly->owner->get_travel_cost()); instead of using the travel cost of the current polygon, we should use the lowest travel of the reachable regions. For an extra bit of performance we can also add the cost of entering the goal's region if it's a different region than the current.
The stopping condition is incorrect: it assumes that if we reach the goal's polygon we can go directly to the goal, but this also yields suboptimal paths. For this the solution is to wrap the goal into its own zero surface polygon. That way if this is polygon reached, we can guarantee that the shortest path was found.

I'll see if I can put together a draft PR with the proposed solutions, so we can discuss in the PR in more concrete terms

sandygk commented 1 year ago

I'm working on the PR, but I am a bit stuck at the moment. The current implementation assumes we should not visit the same polygon multiple times, but this assumption is incorrect if we want to find the shortest path. Sometimes the shortest path may re-enter the same polygon. I am trying to find a way to allow for this without having a huge impact in the performance.

I am researching techniques to achieve this, or at least some approximation, I'll investigate what other engines like Unity do, if possible

sandygk commented 1 year ago

@smix8, so I have the changes working on a POC branch in my local, and they fix the above-mentioned issues, during the implementation I found a few issues with the current implementation and what I would suggest doing is a small refactoring to address the following issues, but I wanted to run it by you first, if you have a chance:

So what I am proposing is:

Create the graph using the centroid of the polygons as nodes, instead of using the polygons themselves: This seems to be the most popular approach from the research I made, it seems like Unity does this according to their documentation. It would look like this: This approach simplifies the code since we will be dealing with actual nodes, that way we can add links as two nodes and represent the start and end points as nodes as well. Right now since the algorithm deals with polygons instead, we need to turn the links into polygons, and the same has to be done for the end point to address one of the issues mentioned on a previous comment, this solution is hacky and error-prone.
Refactor the way we represent connection across nodes: Currently the connections are stored inside the edges of the polygon, this is a bad model to represent the connections in the mesh as soon as you introduce links, a link endpoint usually is located inside a polygon, not on the edge, so storing the connection between the link endpoint and the polygon in one of the edges of the polygon doesn't really make much sense. Connections can still have a reference to its owner, whether it is a polygon, a link, or a point. It should also have the pathway start and end, which is used in the path optimization.
Refactor NavigationPoly to VisitedNode: This class just represents a node that has being visited by the algorithm. It may still store pretty much the same information, just that the poly field should really be owner and store either a polygon a link or a point
Fix the heuristic function to use the minimum reachable cost instead of the cost of the current polygon: This will fix the vast majority of the problems introduced when there are multiple travel costs. The minimum reachable cost could be precomputed on the sync function. This introduces a performance issue when the minimum reachable cost is 0 since it will basically turn A* into Dijkstra, for this I suggest we give a warning to the user that setting travel cost to 0 may affect performance or don't allow them to do it.
Add the start and end point to the graph to fix current stopping condition: Currently the search stops when it reaches the last polygon, this can return suboptimal results. This is particularly bad when both start and end are in the same polygon, adding the end point fixes that.
Add enter cost to the nodes before entering the queue: We currently add it after we get it out of the queue which affect the result, this is a small fix. Note, I am saying queue here in the theoretical sense, it really is a list.
Maybe split the algorithm into multiple functions for readability : This is just a suggestion, but currently we have a really long function to do the entire process, I think it could be split into several functions to improve readability. Stuff like: finding start and end polygon, the search, the path optimization, etc. can be moved to its own function without affecting performance and in my opinion improving readability.

I am happy to attempt to implement all this, but I wanted to run it by you first

QbieShay commented 12 months ago

Hey :wave:

First of all, thank you a lot for taking the time to look into this and even coding your own solution. It's awesome to see so much interest in improving the engine!

Regarding what you propose, I think there's a mismatch between the way we work and what you are proposing. As of now we are at 4.x stable, so compatibility breakage is out of the question unfortunately. By compatibility breakage, I don't only mean changing API, but also significantly change behaviours that users rely on.

Lastly, big refactors are hard to review and tackle, and it would be easier to approach it with smaller changes. While I understand that this is not ideal in the prospect of clean well designed code, it is the reality of maintaining a software that thousands of people use daily. And I know how painful it is to work this way, I've just done a massive refactor of particles and I had to make loads of compromises to keep things working with the previous implementation. So, I understand that what I'm saying here is not pleasant.

All of this to say that the best way to improve navigation would be to find ways to offer what you're proposing without changing existing behaviours and by making smaller changes. I hope this isn't discouraging for you, I want to reiterate that your energy to look into this is really really appreciated :pray:

Feel free also to join our developer chat, if you're not there already https://chat.godotengine.org/ I find it easier sometimes to exchange some messages rather than having very async conversation on github.

sandygk commented 11 months ago

Hi @QbieShay,

First, thank you for the encouragement and the kind words.

What you are saying makes a lot of sense. I'll do my best to add changes little by little. I am actually working on a first PR focussing mostly on naming consistency and removing unnecessary complexity while leaving the behavior intact. I'll I'll keep you posted, thanks for the help

sandygk commented 11 months ago

Hey @QbieShay , @smix8 , to iterate over the process of addressing the above-mentioned issues, I created the following PR to Enhancing Readability and Reducing Complexity in NavMap sync and get_path Methods. The PR only aims to reduce some of the code complexity and improve its readability while preserving its functionality. I found that to be a good first step. So I submit it here for your consideration, if you find the time. Thanks for all the help!

sandygk commented 11 months ago

I created the following issues to track the issues mentioned here individually, so we can close this one

https://github.com/godotengine/godot-proposals/issues/8496 https://github.com/godotengine/godot-proposals/issues/8497 https://github.com/godotengine/godot/issues/85238 https://github.com/godotengine/godot-proposals/issues/8499 https://github.com/godotengine/godot-proposals/issues/8501