microsoft / TextWorld

​TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.
Other
1.23k stars 189 forks source link

Intermediate Rewards and policy commands not working #205

Closed hadivafaii closed 4 years ago

hadivafaii commented 4 years ago

Hi,

It appears like having policy_commands=True and intermediate_reward=True in infos_to_request leads to bugs. For instance, done and has_won return False even after winning a quest. Here is my test code:

import gym
import textworld.gym

print(textworld.__version__, gym.__version__)
>>> 1.1.1 0.10.4

Let's make a simple game

!tw-make tw-simple --rewards dense --goal brief --seed 1234 --output games/my_game.ulx -v -f
>>> Global seed: 1234
>>> Game generated: ~dir/games/my_game.ulx

>>> Objective:
>>> The dinner is almost ready! It's only missing a grilled half of a bag of chips.

>>> Walkthrough:
>>> open antique trunk > take old key from antique trunk > unlock wooden door with old key > open >>> wooden door > go east > open screen door > go east > go south > take half of a bag of chips > >>> go north > go west > put half of a bag of chips on stove

>>> -= Stats =-
>>> Nb. locations: 6
>>> Nb. objects: 28

Load the game using policy_commands=True and intermediate_reward=True in infos_to_request

max_step = 50
path = ["./games/my_game.ulx"]

infos_to_request = textworld.EnvInfos(
    intermediate_reward=True, policy_commands=True, has_won=True, has_lost=True)

env_id = textworld.gym.register_games(
    path, request_infos=infos_to_request, max_episode_steps=max_step)

env = gym.make(env_id)

Get the optimal sequence of commands from the walkthrough

walkthrough = "open antique trunk > take old key from antique trunk > unlock wooden door with old key > open wooden door > go east > open screen door > go east > go south > take half of a bag of chips > go north > go west > put half of a bag of chips on stove"
policy_cmds = walkthrough.split(' > ')

Play the game using policy commands

indx = 0
all_scores = []
done = False

obs, infos = env.reset()

for indx, cmd in enumerate(policy_cmds):
    obs, score, done, infos = env.step(cmd)
    all_scores.append(score)

    print(score, done, infos)

>>> 1 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 2 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 3 >>> False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 4 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 5 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 6 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 7 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 8 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 10 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 10 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 10 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}
>>> 11 False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []}

policy_commands and intermediate_reward return nothing, but there is more

print(done, infos, all_scores)
>>> False {'has_won': False, 'has_lost': False, 'intermediate_reward': 0, 'policy_commands': []} [1, 2, 3, 4, 5, 6, 7, 8, 10, 10, 10, 11]
print(obs)
>>> You put the half of a bag of chips on the stove.
>>> 
>>> Your score has just gone up by one point.
>>> 
>>> 
>>>                                *** The End ***
>>> 
>>> You scored 11 out of a possible 11, in 13 turn(s).
>>> 
>>> 
>>> Would you like to RESTART, RESTORE a saved game, QUIT or UNDO the last command?
>>> > 

According to obs we have won but done and has_won are still False.

Trying the same thing but with policy_commands=False and intermediate_reward=False leads to correct behavior for done and has_won. I tried this using a tw-cooking game and same thing happened. Am I missing something or is this a bug? I upgraded gym to the latest version but it didn't help either.

MarcCote commented 4 years ago

Hi, thanks for the report. I think that has been fixed recently (ref: #184 ).

Do you mind testing with the master branch? Note that has_won and has_lost have been renamed to won and lost (see code below).

import gym

import textworld
import textworld.gym

max_step = 50
path = ["./games/my_game.ulx"]

infos_to_request = textworld.EnvInfos(
    intermediate_reward=True, policy_commands=True, won=True, lost=True)

env_id = textworld.gym.register_games(
    path, request_infos=infos_to_request, max_episode_steps=max_step)

env = gym.make(env_id)

walkthrough = "open antique trunk > take old key from antique trunk > unlock wooden door with old key > open wooden door > go east > open screen door > go east > go south > take half of a bag of chips > go north > go west > put half of a bag of chips on stove"
policy_cmds = walkthrough.split(' > ')

indx = 0
all_scores = []
done = False

obs, infos = env.reset()

for indx, cmd in enumerate(policy_cmds):
    obs, score, done, infos = env.step(cmd)
    all_scores.append(score)

    print(score, done, infos)

print(obs)

Which should outputs:

1 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
2 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
3 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
4 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
5 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
6 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
7 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
8 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
10 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
10 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
10 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
11 True {'won': True, 'lost': False, 'intermediate_reward': 1, 'policy_commands': []}
You put the half of a bag of chips on the stove.

Your score has just gone up by one point.

                               *** The End ***

You scored 11 out of a possible 11, in 13 turn(s).

Would you like to RESTART, RESTORE a saved game, QUIT or UNDO the last command?
hadivafaii commented 4 years ago

I installed the master branch using pip3 install https://github.com/Microsoft/TextWorld/archive/master.zip and it solved the won and done issues. However, I'm still not getting the expected behavior for intermediate_reward and policy_commands.

Here is my code:


import gym
import textworld.gym

print(textworld.__version__, gym.__version__)

>>> 1.1.1 0.15.4

!tw-make tw-simple --rewards dense --goal brief --seed 1234 --output games/my_game.ulx -v -f

>>> Global seed: 1234
>>> Game generated: /home/hadivafa/Dropbox/jup/RLnTW/TW_notebooks/games/my_game.ulx

>>> Objective:
>>> The dinner is almost ready! It's only missing a grilled half of a bag of chips.

>>> Walkthrough:
>>> open antique trunk > take old key from antique trunk > unlock wooden door with old key > >>> open wooden door > go east > open screen door > go east > go south > take half of a bag of >>> chips > go north > go west > put half of a bag of chips on stove

>>> -= Stats =-
>>> Nb. locations: 6
>>> Nb. objects: 28

max_step = 50
path = ["./games/my_game.ulx"]

infos_to_request = textworld.EnvInfos(
    intermediate_reward=True, policy_commands=True, won=True, lost=True)

env_id = textworld.gym.register_games(
    path, request_infos=infos_to_request, max_episode_steps=max_step)

env = gym.make(env_id)

walkthrough = "open antique trunk > take old key from antique trunk > unlock wooden door with old key > open wooden door > go east > open screen door > go east > go south > take half of a bag of chips > go north > go west > put half of a bag of chips on stove"
policy_cmds = walkthrough.split(' > ')

from pprint import pprint
pprint(policy_cmds)

>>> ['open antique trunk',
>>>  'take old key from antique trunk',
>>>  'unlock wooden door with old key',
>>>  'open wooden door',
>>>  'go east',
>>>  'open screen door',
>>>  'go east',
>>>  'go south',
>>>  'take half of a bag of chips',
>>>  'go north',
>>>  'go west',
>>>  'put half of a bag of chips on stove']

done = False
obs, infos = env.reset()

for indx, cmd in enumerate(policy_cmds):
    obs, score, done, infos = env.step(cmd)

    print('_' * 33, "Step.%d, cmd: %s" % (indx, cmd), '_' * 33)
    print("score: %d, done: %s, infos: %s\n\n" % (score, done, infos))

Which outputs:

_________________________________ Step.0, cmd: open antique trunk _________________________________
score: 1, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.1, cmd: take old key from antique trunk _________________________________
score: 2, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.2, cmd: unlock wooden door with old key _________________________________
score: 3, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.3, cmd: open wooden door _________________________________
score: 4, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.4, cmd: go east _________________________________
score: 5, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.5, cmd: open screen door _________________________________
score: 6, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.6, cmd: go east _________________________________
score: 7, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.7, cmd: go south _________________________________
score: 8, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.8, cmd: take half of a bag of chips _________________________________
score: 10, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.9, cmd: go north _________________________________
score: 10, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.10, cmd: go west _________________________________
score: 10, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}

_________________________________ Step.11, cmd: put half of a bag of chips on stove _________________________________
score: 11, done: True, infos: {'won': True, 'lost': False, 'intermediate_reward': 1, 'policy_commands': []}

It seems like intermediate_reward and policy_commands aren't doing what they are supposed to do. For this specific example infos["intermediate_reward"] should be +1 for every step but it returns zeros for all but the last step. And infos["policy_commands"] should return the optimal action (series of actions?) at any given stage of the game but it returns an empty list. Is this a bug or should I do something else to have access to this information?

Thanks, Hadi

MarcCote commented 4 years ago

PR #206 should fix it. Let me know if that works for you.

hadivafaii commented 4 years ago

Tried on tw-simple and custom games and it works