[ROTKR-69] A combat neural network directs an enemy to attack the player

hatfield-c commented 3 years ago

A neural network should be created which directs enemy ships to fight the player.

The network should prioritize:

Shooting inside a target box
Avoiding obstaces

hatfield-c commented 3 years ago

Nice.

hatfield-c commented 3 years ago

The agent should receive a penalty if it gets too close to the player ship during an attack run. In general, the agent should be incentivized to maintain a minimum distance between it and the player.

hatfield-c commented 3 years ago

Required inputs for combat:

Position of the player (relative to agent)
Distance from player (magnitude of position of player)
Direction of player movement (velocity)
Speed of player movement
Direction of player's forward vector
Angular velocity of player
Direction of agent's movement
Speed of agent (magnitude of direction)
Angular velocity of agent
Angle between agent's front and player
Angle between agent's side and player
Ray Sensors from cannons
Ray Sensors for obstacles

Required outputs for combat:

Acceleration value
Turn value
Cannon fire value

hatfield-c commented 3 years ago

Punishment Structure:

Very small punishment every frame
- Incentivize action.
- Should be set up so that a total of -1 is applied at the maximum episode length reset.
Small punishment only when outside a certain distance
- Incentivize the agent to get close to the player.
Moderate punishment when colliding with object
- Incentivize agent to avoid obstacles
Severe punishment when way too close to the player
- Incentivize agent to keep distance from player
- Must be greater than the reward for hitting the target, otherwise the enemy will just sit on top of the target and keep firing.

Reward Structure:

Large reward when cannon ball strikes inside of target sphere
- Incentivize agent to hit rats

hatfield-c commented 3 years ago

Training environment structure:

Create simple, sample Wye geometry. Has a flat ceiling, four walls, and some variations based on the angles of walls.
- One wall at 45 degree angle.
- Two walls at 45 degree angle
- Three walls at 45 degree angle
- All walls at 45 degree angle
The simple geometry will sink with time, to mimic a real environment.
The Collection Chamber prefabs will be used to implement obstacles
With the Collection Chamber prefabs, we can use the Collection Chamber spawn points to choose were to spawn things
We create a "target" prefab, which has 3 spheres on it, and each sphere has a collider that is a trigger. When an agent's cannon ball strikes the sphere, it receives a reward.
The target prefab chooses a random position, and moves toward it.
When the agent gets a hit, the episode is ended, and the environment reset.

hatfield-c commented 3 years ago

There should be two different Agent scripts as well. One used for training, which collects a lot of data, and one used for Inference, which is optimized.

hatfield-c commented 3 years ago

The enemy ships perceive much of their environment through raycasts, which work fine for perceiving terrain. There are also raycasts which come out of the cannons of agent ships, which attempt to perceive the rats on the deck of the ship. However, these raycasts can be blocked by the hunks of the ship as well, and thus the raycasts will need to factor the hunks into its decisions. We can thus introduce a small reward for destroying hunks, but keep the largest reward for destroying rats.

Also, it might be worth it to have the agent perceive the given height of the water directly over its center, as well as some additional points in a radius around the agent.

hatfield-c commented 3 years ago

The reward and punishment values will be stored in a class, and defined as percentages. These values will be accessible by any other script (such as through static variables), and will then be referenced throughout the code to apply rewards/punishments. The percentage definition will help ensure that total reward values never exceed past -1 or 1.

The current percentages to be implemented will be:

Punishment

Frame: 10%
Too Far: 15%
Terrain Collision: 20%
Too Close: 25%
Player Collision: 30%

Reward

Hit Hunk: 5%
Break Hunk: 10%
Hit Rat: 20%
Break Rat: 65%

hatfield-c commented 3 years ago

Note: if we have some issues getting the ship to line itself up, it might be worth it to introduce a very small reward given to the ship simply for lining itself up for a good shot (i.e. if the angle between the ship and player is zero within a certain margin).

So long as the reward is kept small, and rewards are normalized such that their total is is within the range [-1, 1], this should simply incentivize the ship to engage in good behavior while also setting itself up for success for better behavior later.

A potential problem might arise where this causes the ship neglect punishments so long as it maintains an angle of zero, but this should be mitigated so long as we keep this reward small.

hatfield-c commented 3 years ago

The reward structure will need to be reworked, due to an issue where the ship is punished more for surviving a long period of time, then experiencing a large punishment that ends the episode. This occurs due to the fact that the frame punishment slowly accumulates and the episode-end punishment being static, such that the total punishment that occurs at time t = 1 will be less than the total punishment that occurs at time t = 2.

Furthermore, some of the punishments that can occur are mutual exclusive i.e. the terrain collide and the player collide punishments. Thus, since they cannot ever be added together, their values are not dependent on each other.

The second point is easy to resolve - we can now set these mutually exclusive values to whatever we please without concerning what the value of the other exclusives are.

The first point however, requires that we make the non-time dependent episode-end punishments become time dependent through some sort annealing schedule. While numerous possible annealing schedules exist, for now we can simply use a linear decrease with time. Thus, there will be a minimum value and a maximum value for these punishments, and when the episode is ended and these punishments applied the time elapsed will determine a Lerped punishment value between these minimum/maximum values.

To further ensure that the first point does not occur even under this new paradigm, we must introduce a constrain on the episode-end punishments and the per-frame punishments. The constraint is as follows:

The maximum/starting value of an episode-end punishment must be strictly larger than the sum of all total per-frame punishments

That is to say, if we were to get the value of a per-frame punishment summed across all frame, and we add all these total sums of the different per-frame punishments together, then this final sum must be strictly less than any episode-end punishment.

It thus makes more sense to define the per-frame punishments in whatever way we see fit, and then define the episode-end punishments based off of them.

hatfield-c commented 3 years ago

The new percentages to be implemented will be:

Punishment -Frame: 5% -Too Far: 5% -Too Close: 15% -Terrain Collision: 50% -Player Collision: 75%

Reward -Hit Hunk: 5% -Break Hunk: 10% -Hit Rat: 20% -Break Rat: 65%

hatfield-c commented 3 years ago

The environment seems ready to go. To help facilitate learning at a greater speed, we will be using the curriculum methodology, wherein we give the agent simpler problems to solve at first, and then make the problems more difficult as it masters each one.

While there is support for automated curricula in the Unity ML Agents package, for the time being we will be implementing the curriculum changes manually. Thus, the curriculum will go through the following steps:

Start
- Target does not move or rotate
- No terrain
- Wye geometry lowers itself into water
Target can rotate itself
Target can move slowly, at constant speed
Target can can move at constant speed
Target's speed will be randomized
Terrain now spawns
Water now has waves

Once each step of the curriculum is complete, the resulting neural network will be properly archived and committed to the github repository. This neural network will then be used as the starting point for next step, and so the process continues.

hatfield-c commented 3 years ago

To help speed the learning process along, we can use imitation learning so that the Agent has a better idea of how to shoot the player.

hatfield-c commented 3 years ago

We've had some luck with training, but the convergence rate is still not quite what we'd like it to be. Thus, there will be modifications to how we do rewards, observations, as well as actions.

hatfield-c commented 3 years ago

Actions will now be discrete, rather than continuous:

action[0] (forward : 5):
- 0 : Do nothing
- 1 : Forward fast
- 2 : Forward slow
- 3 : Backward fast
- 4 : Backward slow
action[1] (turn : 5):
- 0 : Do nothing
- 1 : Turn right fast
- 2 : Turn right slow
- 3 : Turn left fast
- 4 : Turn left slow
action[2] (shoot : 2):
- 0 : Do nothing
- 1 : Activate cannons

hatfield-c commented 3 years ago

Observations will be significantly reworked, reduced, and simplified. This will help reduce the amount of statistical space that the agent must explore and make decisions across, while also allowing the agent to determine relationships between states and rewards.

agent.transform.forward DOT directionToPlayer
agent.transform.right DOT directionToPlayer
agent.transform.left DOT directionToPlayer
agent.transform.forward DOT agent.velocity.normalized
agent.velocity.normalized DOT directionToPlayer
agentVelocity.normalized (whole Vector3 given)
agentSpeed
playerSpeed
distance to player
agent.angularVelocity.y

hatfield-c commented 3 years ago

Every-frame rewards will be reworked. The distance punishments and the inaction punishments will be made mutually exclusive, such that every frame only one of the following will be applied:

Punishment if too close too the player
Punishment if too far from the player
If neither are true, then an inaction punishment.

For cumulative punishment values, we can take the largest of these.

We will also be adding a punishment for the angular velocity as well, to help encourage the agent not to spin itself around over and over again.

hatfield-c commented 3 years ago

The curriculumn will be redefined:

Start
- Target does not move or rotate
- No terrain
- Wye geometry lowers itself into water
- Only one of the Training Wye types will be used at a time.
Target can rotate itself
Target can move slowly, at constant speed
Target can can move at constant speed
Target's speed will be randomized
Terrain now spawns
Water now has waves

hatfield-c commented 3 years ago

A stable neural network was successfully trained which controls the ship to move towards the player, stop, rotate, and then shoot the player pretty reliably. It took about 2 hours to train, and while it occasionally crashes into the player it works quite well.

However, to reach this point I cut out the visual sensors and the water height samplers. Upon reflection, we can probably do without the samplers, however the sensors are necessary so that the agent can successfully perceive obstacles within its environment, and potentially use them to help facilitate aiming.

But therein lies the problem. After the stable neural network was made, I attempted to train another network using the same observations, but with the visual sensors added back in. This time, the training process didn't converge at all, and the agent ended up sitting in place and spinning around. I believe that this is because the sensors which come default with the ML-Agents package are sending either extraneous or poorly formatted data, including the following possibilities:

On observation, the spherecast distance, hit/miss, tag, etc are all sent to separate input nodes of the neural network
If a tag is found that isn't in the "sensible" tag list, then some sort of flag such as "-1" is passed into one of the previously mentioned inputs.

I will need to dig into the ML-Agents source code to verify this. If my hunch turns out to be true however, then we can likely create our own very simple sensor and plug its observations in manually in the CollectObservations function. To help keep things simple and the input space small, we can just have each sensor return hit/miss data (1 or 0), and nothing else.

This paradigm would still allow us to perform tag discrimination, such that if a non-sensible tag is encountered, then a 0 is passed into the input.

However, I will do some more research before I head down this path.

hatfield-c commented 3 years ago

Essentially, disregard most of the previous comments, as the AI training processes has undergone numerous revisions.

The present iteration of the neural networks are in a workable state, but they are by no means perfect. We will absolutely need to retrain them in the future, as they sometimes have issues with avoiding the ceiling as the Wye sinks. Nevertheless, the results thus far have been promising.

While attempting to train a single network to perform a strafing behavior, I accidentally trained a few others as well which, while not the intended result, still display interesting and useful behaviors. The list of behaviors I have created are as follows:

strafe
- Slowly circles into the player, before lining up for a shot, and then driving past them in a straight line.
timid
- Fears the player, and will only occasionally come in for a fight - otherwise, it just circles around the player's effective range.
fastOrbit
- Quickly drives straight up to the player, and then orbits around the player at its fastest speed.
rammingSpeed
- Disregards all notions of safety of etiquette, and attempts to ram the player over and over again. May be useful for suicide bombers.

I have also finished the shooting script for the enemy ship as well. It is simple and deterministic, and doesn't do a great job of aiming. But it works, and is still quite effective at shooting hunks off the player ship.

As such, I am going to call this issue done, and pass it on to QA. To QA this issue:

Open up the master scene
Proceed through the levels
Verify that the enemy ships seek out and shoot at the player

hatfield-c / rotkr

[ROTKR-69] A combat neural network directs an enemy to attack the player #69