Cheng Xue*, Vimukthini Pinto*, Chathura Gamage*
Ekaterina Nikonova, Peng Zhang, Jochen Renz
School of Computing
The Australian National University
Canberra, Australia
{cheng.xue, vimukthini.inguruwattage, chathura.gamage}@anu.edu.au
{ekaterina.nikonova, p.zhang, jochen.renz}@anu.edu.au
Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. We create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent using the physical scenarios we considered. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score.
* equal contribution
The research paper can be found here: https://www.nature.com/articles/s42256-022-00583-4
We consider 15 physical scenarios in Phy-Q benchmark. Firstly, we consider the basic physical scenarios associated with applying forces directly on the target objects, i.e., the effect of a single force and the effect of multiple forces. On top of simple forces application, we also include the scenarios associated with more complex motion including rolling, falling, sliding, and bouncing, which are inspired by the physical reasoning capabilities developed in human infancy. Furthermore, we define the objects' relative weight, the relative height, the relative width, the shape differences, and the stability scenarios, which require physical reasoning abilities infants acquire typically in a later stage. On the other hand, we also incorporate clearing path, adequate timing, and manoeuvring capabilities, and taking non-greedy actions, which are required to overcome challenges for robots to work safely and efficiently in physical environments. To sum up, the physical scenarios we consider and the corresponding physical rules that can use to achieve the goal of the associated tasks are:
Based on the above physical scenarios, we develop Phy-Q benchmark in Angry Birds. Phy-Q contains tasks from 75 task templates belonging to the fifteen scenarios. The goal of an agent is to destroy all the pigs (green-coloured objects) in the tasks by shooting a given number of birds from the slingshot. Shown below are fifteen example tasks in Phy-Q representing the fifteen scenarios and the solutions for those tasks.
Task | Description |
---|---|
1. Single force: A single force is needed to be applied to the pig to destroy it by a direct bird shot. | |
2. Multiple forces: Multiple forces are needed to be applied to destroy the pig by multiple bird shots. | |
3. Rolling: The circular object is needed to be rolled onto the pig, which is unreachable for the bird from the slingshot, causing the pig to be destroyed. | |
4. Falling: The circular object is needed to be fallen onto the pig causing the pig to be destroyed. | |
5. Sliding: The square object is needed to be slid to hit the pig, which is unreachable for the bird from the slingshot, causing the pig to be destroyed. | |
6. Bouncing: The bird is needed to be bounced off the platform (dark-brown object) to hit and destroy the pig. | |
7. Relative weight: The small circular block is lighter than the big circular block. Out of the two blocks, the small circular block can only be rolled to reach the pig and destroy. | |
8. Relative height: The square block on top of the taller rectangular block will not fall through the gap due to the height of the rectangular block. Hence the square block on top of the shorter rectangular block needs to be toppled to fall through the gap and destroy the pig. | |
9. Relative width: The bird cannot go through the lower entrance which has a narrow opening. Hence the bird is needed to be shot to the upper entrance to reach the pig and destroy it. | |
10. Shape difference: The circular block on two triangle blocks can be rolled down by breaking a one triangle block and the circular block on two square blocks cannot be rolled down by breaking a one square block. Hence, the triangle block needs to be destroyed to make the circular block roll and fall onto the pig causing the pig to be destroyed. | |
11. Non-greedy actions: A greedy action tries to destroy the highest number of pigs in a single bird shot. If the two pigs resting on the circular block are destroyed, then the circular block will roll down and block the entrance to reach the below pig. Hence, the below pig is needed to be destroyed first and then the upper two pigs. | |
12. Structural analysis: The bird is needed to be shot at the weak point of the structure to break the stability and destroy the pigs. Shooting elsewhere does not destroy the two pigs with a single bird shot. | |
13. Clearing paths: First, the rectangle block is needed to be positioned correctly to open the path for the circular block to reach the pig. Then the circular block is needed to be rolled to destroy the pig. | |
14. Adequate timing: First, the two circular objects are needed to be rolled to the ramp. Then, after the first circle passes the prop and before the second circle reaches the prop, the prop needs to be destroyed to make the second circle fall onto the lower pig. | |
15. Manoeuvring: The blue bird splits into three other birds when it is tapped in the flight. The blue bird is needed to be tapped at the correct position to manoeuvre the birds to reach the two separated pigs. |
Sceenshots of the 75 task templates are shown below. x.y represents the yth template of the xth scenario. The indexes of the scenarios are: 1. single force, 2. multiple forces, 3. rolling, 4. falling, 5. sliding, 6. bouncing, 7. relative weight, 8. relative height, 9. relative width, 10. shape difference, 11. non-greedy actions, 12. structural analysis, 13. clearing paths, 14. adequate timing, and 15. manoeuvring:
1.1 | 1.2 | 1.3 |
1.4 | 1.5 | 2.1 |
2.2 | 2.3 | 2.4 |
2.5 | 3.1 | 3.2 |
3.3 | 3.4 | 3.5 |
3.6 | 4.1 | 4.2 |
4.3 | 4.4 | 4.5 |
5.1 | 5.2 | 5.3 |
5.4 | 5.5 | 6.1 |
6.2 | 6.3 | 6.4 |
6.5 | 6.6 | 7.1 |
7.2 | 7.3 | 7.4 |
7.5 | 8.1 | 8.2 |
8.3 | 8.4 | 9.1 |
9.2 | 9.3 | 9.4 |
10.1 | 10.2 | 10.3 |
10.4 | 11.1 | 11.2 |
11.3 | 11.4 | 11.5 |
12.1 | 12.2 | 12.3 |
12.4 | 12.5 | 12.6 |
13.1 | 13.2 | 13.3 |
13.4 | 13.5 | 14.1 |
14.2 | 15.1 | 15.2 |
15.3 | 15.4 | 15.5 |
15.6 | 15.7 | 15.8 |
We develop a Task Generator that can generate tasks for the task templates we designed for each scenario.
tasks/task_generator
input
(level templates can be found in tasks/task_templates
)python generate_tasks.py <number of tasks to generate>
output
We generated 100 tasks from each of the 75 task templates for the baseline analysis. We have categorized the 15 scenarios into 3 categories for convenience. The scenarios belong to each category are: category 1 (1.1 single force and 1.2 multiple forces), category 2 (2.1 rolling, 2.2 falling, 2.3 sliding, and 2.4 bouncing), and category 3 (3.1 relative weight, 3.2 relative height, 3.3 relative width, 3.4 shape difference, 3.5 non-greedy actions, 3.6 structural analysis, 3.7 clearing paths, 3.8 adequate timing, and 3.9 manoeuvring). Here x.y represents the yth scenario of the xth category. The generated tasks can be found in tasks/generated_tasks.zip
. After extracting this file, the generatd tasks can be found located in the folder structure:
generated_tasks/
-- index of the category/
-- index of the scenario/
-- index of the template/
-- task files named as categoryIndex_scenarioIndex_templateIndex_taskIndex.xml
If you want to design your own task templates, you can use the interactive Task Template Designer tool we have provided, which is developed in Unity.
To design your own task template:
tasks/task_template_designer
in UnityLevel Editor -> Edit Level
in the top-menu of the Unity editorSave Level
button in the Level Editor menuTo generate tasks using your own task template
tasks/task_generator/utils/generate_variations.py
script of the Task GeneratorTested environments:
Before running agents, please:
sciencebirdsgames
and unzip Linux.zip
sciencebirdslevels/generated_tasks
and unzip fifth_generation.zip
Run Java heuristic agents: Datalab and Eagle Wings:
Utils
and in terminal run
python PrepareTestConfig.py --os [Linux/MacOS]
sciencebirdsgames/Linux
, in terminal run
java -jar game_playing_interface.jar
sciencebirdsagents/HeuristicAgents/
and in terminal run Datalab
java -jar datalab_037_v4_java12.jar 1
or Eagle Wings
java -jar eaglewings_037_v3_java12.jar 1
Note that the integer 1 in the end controls the number of agents to be running. You can set it to different integer value that suits you the best.
Run Random Agent and Pig Shooter:
sciencebirdsagents/
./TestPythonHeuristicAgent.sh RandomAgent
or Pig Shooter
./TestPythonHeuristicAgent.sh PigShooter
sciencebirdsagents/Utils
Parameters.py
and set agent
to be DQNDiscreteAgent and network
to be DQNSymbolicDuelingFC_v2 for DQN and DQNRelationalSymbolic for Deep Relationa, and state_repr_type
to be "symbolic"Go to sciencebirdsagents/Utils
Open Parameters.py
and set agent
to be DQNDiscreteAgent and network
to be DQNImageResNet for DQN and DQNRelationalImage for Deep Relationa and state_repr_type
to be "image"
Go to sciencebirdsagents/
In terminal, after grant execution permission, train the agent for within template training
./TrainLearningAgent.sh within_template
and for within scenatio
./TrainLearningAgent.sh benchmark
Models will be saved to sciencebirdsagents/LearningAgents/saved_model
To test learning agents, go the folder sciencebirdsagents
:
python TestAgentOfflineWithinTemplate.py
python TestAgentOfflineWithinCapability.py
sciencebirdsagents/Utils
Parameters.py
and set agent
to be "ppo" or "a2c" and state_repr_type
to be "symbolic"Go to sciencebirdsagents/Utils
Open Parameters.py
and set agent
to be "ppo" or "a2c" and state_repr_type
to be "image"
Go to sciencebirdsagents/
In terminal, after grant execution permission, train the agent for within template training
./TrainAndTestOpenAIStableBaselines.sh within_template
and for within scenatio
./TrainAndTestOpenAIStableBaselines.sh benchmark
Models will be saved to sciencebirdsagents/OpenAIModelCheckpoints
and tensorboard log will be saved to OpenAIStableBaseline
We provide a gym-like environment. For a simple demo, which can be found at demo.py
from SBAgent import SBAgent
from SBEnvironment.SBEnvironmentWrapper import SBEnvironmentWrapper
# for using reward as score and 50 times faster game play
env = SBEnvironmentWrapper(reward_type="score", speed=50)
level_list = [1, 2, 3] # level list for the agent to play
dummy_agent = SBAgent(env=env, level_list=level_list) # initialise agent
dummy_agent.state_representation_type = 'image' # use symbolic representation as state and headless mode
env.make(agent=dummy_agent, start_level=dummy_agent.level_list[0],
state_representation_type=dummy_agent.state_representation_type) # initialise the environment
s, r, is_done, info = env.reset() # get ready for running
for level_idx in level_list:
is_done = False
while not is_done:
s, r, is_done, info = env.step([-100, -100]) # agent always shoots at -100,100 as relative to the slingshot
env.current_level = level_idx+1 # update the level list once finished the level
if env.current_level > level_list[-1]: # end the game when all game levels in the level list are played
break
s, r, is_done, info = env.reload_current_level() #go to the next level
The ./sciencebirdsagents
folder contains all the relevant source code of our agents. Below is the outline of the code (this is a
simple description. Detailed documentation in progress):
Client
:
agent_client.py
: Includes all communication protocols.final_run
: Place to store tensor board results.HeuristicAgents
datalab_037_v4_java12.jar
: State-of-the-art java agent for Angry Birds.eaglewings_037_v3_java12.jar
: State-of-the-art java agent for Angry Birds.PigShooter.py
: Python agent that shoots at the pigs only.RandomAgent.py
: Random agent that choose to shoot from $x \in (-100,-10)$ and $y \in (-100,100)$.HeuristicAgentThread.py
: A thread wrapper to run multi-instances of heuristic agents.LearningAgents
RLNetwork
: Folder includes all DQN structures that can be used as an input to DQNDiscreteAgent.py
.saved_model
: Place to save trained models.LearningAgent.py
: Inherited from SBAgent class, a base class to implement learning agents.DQNDiscreteAgent.py
: Inherited from LearningAgent, a DQN agent that has discrete action space.LearningAgentThread.py
: A thread wrapper to run multi-instances of learning agents.Memory.py
: A script that includes different types of memories. Currently, we have normal memory,
PrioritizedReplayMemory and PrioritizedReplayMemory with balanced samples.SBEnvironment
SBEnvironmentWrapper.py
: A wrapper class to provide gym-like environment.SBEnvironmentWrapperOpenAI.py
: A wrapper class to provide gym-like environment for OpenAI Stable Baseline 3 agents.Server.py
: A wrapper class for the game server for the OpenAI Stable Baseline 3 agents.StateReader
: Folder that contains files to convert symbolic state representation to inputs to the agents.Utils
:
Config.py
: Config class that used to pass parameter to agents.GenerateCapabilityName.py
: Generate a list of names of capability for agents to train.GenerateTemplateName.py
: Generate a list of names of templates for agents to train.LevelSelection.py
: Class that includes different strategies to select levels. For example, an agent may
choose to go to the next level if it passes the current one, or only when it has played the current level for a
predefined number of times.NDSparseMatrix.py
: Class to store converted symbolic representation in a sparse matrix to save memory
usage.Parameters.py
: Training/testing parameters used to pass to the agent.PrepareTestConfig.py
: Script to generate config file for the game console to use for testing agents only.trajectory_planner.py
: It calculates two possible trajectories given a directly reachable target point. It returns None if the target is non-reachable by the birddemo.py
: A demo to showcase how to use the framework.SBAgent.py
: Base class for all agents.MultiAgentTestOnly.py
: To test python heuristic agents with running multiple instances on one particular template.TestAgentOfflineWithinCapability.py
: Using the saved models in LearningAgents/saved_model
to test agent's
within capability performance on test set.TestAgentOfflineWithinTemplate.py
: Using the saved models in LearningAgents/saved_model
to test agent's
within template performance on test set.TrainLearningAgent.py
: Script to train DQN baseline agents on particular template with defined mode.TestPythonHeuristicAgent.sh
: Bash Script to test heuristic agent's performance on all templates.TrainLearningAgent.sh
: Bash Script to train DQN baseline agents to test both local and board generalization. OpenAI_StableBaseline_Train.py
: Python script to run OpenAI Stable Baseline 3 agents on particular template with defined mode..TrainAndTestOpenAIStableBaselines.sh
: Bash script to run OpenAI Stable Baseline 3 agents to test both local and board generalization.Symbolic Representation data of game objects is stored in a Json object. The json object describes an array where each element describes a game object. Game object categories, and their properties are described below:
Ground: the lowest unbreakable flat support surface
Platform: Unbreakable obstacles
Trajectory: the dots that represent the trajectories of the birds
Slingshot: Unbreakable slingshot for shooting the bird
Red Bird:
all objects below have the same representation as red bird
Blue Bird:
Yellow Bird:
White Bird:
Black Bird:
Small Pig:
Medium Pig:
Big Pig:
TNT: an explosive block
Wood Block: Breakable wooden blocks
Ice Block: Breakable ice blocks
Stone Block: Breakable stone blocks
Round objects are also represented as polygons with a list of vertices
Symbolic Representation with noise
Message ID | Request | Format (byte[ ]) | Return | Format (byte[ ]) |
---|---|---|---|---|
1-10 | Configuration Messages | |||
1 | Configure team ID Configure running mode |
[1][ID][Mode] ID: 4 bytes Mode: 1 byte COMPETITION = 0 TRAINING = 1 |
Four bytes array. The first byte indicates the round; the second specifies the time limit in minutes; the third specifies the number of available levels |
[round info][time limit][available levels] Note: in training mode, the return will be [0][0][0]. As the round info is not used in training, the time limit will be 600 hours, and the number of levels needs to be requested via message ID 15 |
2 | Set simulation speed speed$\in$[0.0, 50.0] Note: this command can be sent at anytime during playing to change the simulation speed |
[2][speed] speed: 4 bytes |
OK/ERR | [1]/[0] |
11-30 | Query Messages | |||
11 | Do Screenshot | [11] | Width, height, image bytes Note: this command only returns screenshots without symbolic representation |
[width][height][image bytes] width, height: 4 bytes |
12 | Get game state | [12] | One byte indicates the ordinal of the state | [0]: UNKNOWN [1] : MAIN_MENU [2]: EPISODE_MENU [3]: LEVEL_SELECTION [4]: LOADING [5]: PLAYING [6]: WON [7]: LOST |
14 | Get the current level | [14] | four bytes array indicates the index of the current level | [level index] |
15 | Get the number of levels | [15] | four bytes array indicates the number of available levels | [number of level] |
23 | Get my score | [23] | A 4 bytes array indicating the number of levels followed by ([number_of_levels] * 4) bytes array with every four slots indicates a best score for the corresponding level |
[number_of_levels][score_level_1]....[score_level_n] Note: This should be used carefully for the training mode, because there may be large amount of levels used in the training. Instead, when the agent is in winning state, use message ID 65 to get the score of a single level at winning state |
31-50 | In-Game Action Messages | |||
31 | Shoot using the Cartesian coordinates [Safe mode*] |
[31][fx][fy][dx][dy][t1][t2] focus_x : the x coordinate of the focus point focus_y: the y coordinate of the focus point dx: the x coordinate of the release point minus focus_x dy: the y coordinate of the release point minus focus_y t1: the release time t2: the gap between the release time and the tap time. If t1 is set to 0, the server will execute the shot immediately. The length of each parameter is 4 bytes |
OK/ERR | [1]/[0] |
32 | Shoot using Polar coordinates [Safe mode*] | [32][fx][fy][theta][r][t1][t2] theta: release angle r: the radial coordinate The length of each parameter is 4 bytes |
OK/ERR | [1]/[0] |
33 | Sequence of shots [Safe mode*] | [33][shots length][shot message ID][Params]...[shot message ID][Params] Maximum sequence length: 16 shots |
An array with each slot indicates good/bad shot. The bad shots are those shots that are rejected by the server |
For example, the server received 5 shots, and the third one was not executed due to some reason, then the server will return [1][1][0][1][1] |
41 | Shoot using the Cartesian coordinates [Fast mode**] |
[41][fx][fy][dx][dy][t1][t2] The length of each parameter is 4 bytes |
OK/ERR | [1]/[0] |
42 | Shoot using Polar coordinates [Fast mode**] | [42][fx][fy][theta][r][t1][t2] The length of each parameter is 4 bytes |
OK/ERR | [1]/[0] |
43 | Sequence of shots [Fast mode**] | [43][shots length][shot message ID][Params]...[shot message ID][Params] Maximum sequence length: 16 shots |
An array with each slot indicates good/bad shot. The bad shots are those shots that are rejected by the server |
For example, the server received 5 shots, and the third one was not executed due to some reason, then the server will return [1][1][0][1][1] |
34 | Fully Zoom Out | [34] | OK/ERR | [1]/[0] |
35 | Fully Zoom In | [35] | OK/ERR | [1]/[0] |
51-60 | Level Selection Messages | |||
51 | Load a level | [51][Level] Level: 4 bytes |
OK/ERR | [1]/[0] |
52 | Restart a level | [52] | OK/ERR | [1]/[0] |
61-70 | Science Birds Specific Messages | |||
61 | Get Symbolic Representation With Screenshot | [61] | Symbolic Representation and corresponding screenshot | [symbolic representation byte array length][Symbolic Representation bytes][image width][image height][image bytes] symbolic representation byte array length: 4 bytes image width: 4 bytes image height: 4 bytes |
62 | Get Symbolic Representation Without Screenshot | [62] | Symbolic Representation | [symbolic representation byte array length][Symbolic Representation bytes] |
63 | Get Noisy Symbolic Representation With Screenshot | [63] | noisy Symbolic Representation and corresponding screenshot | [symbolic representation byte array length][Symbolic Representation bytes][image width][image height][image bytes] |
64 | Get Noisy Symbolic Representation Without Screenshot | [64] | noisy Symbolic Representation | [symbolic representation byte array length][Symbolic Representation bytes] |
65 | Get Current Level Score | [65] | current score Note: this score can be requested at any time at Playing/Won/Lost state This is used for agents that take intermediate score seriously during training/reasoning To get the winning score, please make sure to execute this command when the game state is "WON" |
[score] score: 4 bytes |
* Safe Mode: The server will wait until the state is static after making a shot. | ||||
** Fast mode: The server will send back a confirmation once a shot is made. The server will not do any check for the appearance of the won page. |
Play data folder contains two zip files. human_player_data.zip and baseline_agent_data.zip.
The human player data on Phy-Q is given in human_player_data.zip
. This includes summarized data for 20 players. Each .csv file is for a player and the following are the columns.