Open hatfield-c opened 4 years ago
Nice.
The agent should receive a penalty if it gets too close to the player ship during an attack run. In general, the agent should be incentivized to maintain a minimum distance between it and the player.
Required inputs for combat:
Required outputs for combat:
Punishment Structure:
Reward Structure:
Training environment structure:
There should be two different Agent scripts as well. One used for training, which collects a lot of data, and one used for Inference, which is optimized.
The enemy ships perceive much of their environment through raycasts, which work fine for perceiving terrain. There are also raycasts which come out of the cannons of agent ships, which attempt to perceive the rats on the deck of the ship. However, these raycasts can be blocked by the hunks of the ship as well, and thus the raycasts will need to factor the hunks into its decisions. We can thus introduce a small reward for destroying hunks, but keep the largest reward for destroying rats.
Also, it might be worth it to have the agent perceive the given height of the water directly over its center, as well as some additional points in a radius around the agent.
The reward and punishment values will be stored in a class, and defined as percentages. These values will be accessible by any other script (such as through static variables), and will then be referenced throughout the code to apply rewards/punishments. The percentage definition will help ensure that total reward values never exceed past -1 or 1.
The current percentages to be implemented will be:
Punishment
Reward
Note: if we have some issues getting the ship to line itself up, it might be worth it to introduce a very small reward given to the ship simply for lining itself up for a good shot (i.e. if the angle between the ship and player is zero within a certain margin).
So long as the reward is kept small, and rewards are normalized such that their total is is within the range [-1, 1], this should simply incentivize the ship to engage in good behavior while also setting itself up for success for better behavior later.
A potential problem might arise where this causes the ship neglect punishments so long as it maintains an angle of zero, but this should be mitigated so long as we keep this reward small.
The reward structure will need to be reworked, due to an issue where the ship is punished more for surviving a long period of time, then experiencing a large punishment that ends the episode. This occurs due to the fact that the frame punishment slowly accumulates and the episode-end punishment being static, such that the total punishment that occurs at time t = 1 will be less than the total punishment that occurs at time t = 2.
Furthermore, some of the punishments that can occur are mutual exclusive i.e. the terrain collide and the player collide punishments. Thus, since they cannot ever be added together, their values are not dependent on each other.
The second point is easy to resolve - we can now set these mutually exclusive values to whatever we please without concerning what the value of the other exclusives are.
The first point however, requires that we make the non-time dependent episode-end punishments become time dependent through some sort annealing schedule. While numerous possible annealing schedules exist, for now we can simply use a linear decrease with time. Thus, there will be a minimum value and a maximum value for these punishments, and when the episode is ended and these punishments applied the time elapsed will determine a Lerped punishment value between these minimum/maximum values.
To further ensure that the first point does not occur even under this new paradigm, we must introduce a constrain on the episode-end punishments and the per-frame punishments. The constraint is as follows:
That is to say, if we were to get the value of a per-frame punishment summed across all frame, and we add all these total sums of the different per-frame punishments together, then this final sum must be strictly less than any episode-end punishment.
It thus makes more sense to define the per-frame punishments in whatever way we see fit, and then define the episode-end punishments based off of them.
The new percentages to be implemented will be:
Punishment -Frame: 5% -Too Far: 5% -Too Close: 15% -Terrain Collision: 50% -Player Collision: 75%
Reward -Hit Hunk: 5% -Break Hunk: 10% -Hit Rat: 20% -Break Rat: 65%
The environment seems ready to go. To help facilitate learning at a greater speed, we will be using the curriculum methodology, wherein we give the agent simpler problems to solve at first, and then make the problems more difficult as it masters each one.
While there is support for automated curricula in the Unity ML Agents package, for the time being we will be implementing the curriculum changes manually. Thus, the curriculum will go through the following steps:
Once each step of the curriculum is complete, the resulting neural network will be properly archived and committed to the github repository. This neural network will then be used as the starting point for next step, and so the process continues.
To help speed the learning process along, we can use imitation learning so that the Agent has a better idea of how to shoot the player.
We've had some luck with training, but the convergence rate is still not quite what we'd like it to be. Thus, there will be modifications to how we do rewards, observations, as well as actions.
Actions will now be discrete, rather than continuous:
Observations will be significantly reworked, reduced, and simplified. This will help reduce the amount of statistical space that the agent must explore and make decisions across, while also allowing the agent to determine relationships between states and rewards.
Every-frame rewards will be reworked. The distance punishments and the inaction punishments will be made mutually exclusive, such that every frame only one of the following will be applied:
For cumulative punishment values, we can take the largest of these.
We will also be adding a punishment for the angular velocity as well, to help encourage the agent not to spin itself around over and over again.
The curriculumn will be redefined:
A stable neural network was successfully trained which controls the ship to move towards the player, stop, rotate, and then shoot the player pretty reliably. It took about 2 hours to train, and while it occasionally crashes into the player it works quite well.
However, to reach this point I cut out the visual sensors and the water height samplers. Upon reflection, we can probably do without the samplers, however the sensors are necessary so that the agent can successfully perceive obstacles within its environment, and potentially use them to help facilitate aiming.
But therein lies the problem. After the stable neural network was made, I attempted to train another network using the same observations, but with the visual sensors added back in. This time, the training process didn't converge at all, and the agent ended up sitting in place and spinning around. I believe that this is because the sensors which come default with the ML-Agents package are sending either extraneous or poorly formatted data, including the following possibilities:
I will need to dig into the ML-Agents source code to verify this. If my hunch turns out to be true however, then we can likely create our own very simple sensor and plug its observations in manually in the CollectObservations function. To help keep things simple and the input space small, we can just have each sensor return hit/miss data (1 or 0), and nothing else.
This paradigm would still allow us to perform tag discrimination, such that if a non-sensible tag is encountered, then a 0 is passed into the input.
However, I will do some more research before I head down this path.
Essentially, disregard most of the previous comments, as the AI training processes has undergone numerous revisions.
The present iteration of the neural networks are in a workable state, but they are by no means perfect. We will absolutely need to retrain them in the future, as they sometimes have issues with avoiding the ceiling as the Wye sinks. Nevertheless, the results thus far have been promising.
While attempting to train a single network to perform a strafing behavior, I accidentally trained a few others as well which, while not the intended result, still display interesting and useful behaviors. The list of behaviors I have created are as follows:
I have also finished the shooting script for the enemy ship as well. It is simple and deterministic, and doesn't do a great job of aiming. But it works, and is still quite effective at shooting hunks off the player ship.
As such, I am going to call this issue done, and pass it on to QA. To QA this issue:
A neural network should be created which directs enemy ships to fight the player.
The network should prioritize: