Closed hadivafaii closed 4 years ago
Hi, thanks for the report. I think that has been fixed recently (ref: #184 ).
Do you mind testing with the master branch? Note that has_won
and has_lost
have been renamed to won
and lost
(see code below).
import gym
import textworld
import textworld.gym
max_step = 50
path = ["./games/my_game.ulx"]
infos_to_request = textworld.EnvInfos(
intermediate_reward=True, policy_commands=True, won=True, lost=True)
env_id = textworld.gym.register_games(
path, request_infos=infos_to_request, max_episode_steps=max_step)
env = gym.make(env_id)
walkthrough = "open antique trunk > take old key from antique trunk > unlock wooden door with old key > open wooden door > go east > open screen door > go east > go south > take half of a bag of chips > go north > go west > put half of a bag of chips on stove"
policy_cmds = walkthrough.split(' > ')
indx = 0
all_scores = []
done = False
obs, infos = env.reset()
for indx, cmd in enumerate(policy_cmds):
obs, score, done, infos = env.step(cmd)
all_scores.append(score)
print(score, done, infos)
print(obs)
Which should outputs:
1 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
2 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
3 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
4 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
5 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
6 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
7 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
8 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
10 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
10 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
10 False {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
11 True {'won': True, 'lost': False, 'intermediate_reward': 1, 'policy_commands': []}
You put the half of a bag of chips on the stove.
Your score has just gone up by one point.
*** The End ***
You scored 11 out of a possible 11, in 13 turn(s).
Would you like to RESTART, RESTORE a saved game, QUIT or UNDO the last command?
I installed the master branch using pip3 install https://github.com/Microsoft/TextWorld/archive/master.zip
and it solved the won
and done
issues. However, I'm still not getting the expected behavior for intermediate_reward
and policy_commands
.
Here is my code:
import gym
import textworld.gym
print(textworld.__version__, gym.__version__)
>>> 1.1.1 0.15.4
!tw-make tw-simple --rewards dense --goal brief --seed 1234 --output games/my_game.ulx -v -f
>>> Global seed: 1234
>>> Game generated: /home/hadivafa/Dropbox/jup/RLnTW/TW_notebooks/games/my_game.ulx
>>> Objective:
>>> The dinner is almost ready! It's only missing a grilled half of a bag of chips.
>>> Walkthrough:
>>> open antique trunk > take old key from antique trunk > unlock wooden door with old key > >>> open wooden door > go east > open screen door > go east > go south > take half of a bag of >>> chips > go north > go west > put half of a bag of chips on stove
>>> -= Stats =-
>>> Nb. locations: 6
>>> Nb. objects: 28
max_step = 50
path = ["./games/my_game.ulx"]
infos_to_request = textworld.EnvInfos(
intermediate_reward=True, policy_commands=True, won=True, lost=True)
env_id = textworld.gym.register_games(
path, request_infos=infos_to_request, max_episode_steps=max_step)
env = gym.make(env_id)
walkthrough = "open antique trunk > take old key from antique trunk > unlock wooden door with old key > open wooden door > go east > open screen door > go east > go south > take half of a bag of chips > go north > go west > put half of a bag of chips on stove"
policy_cmds = walkthrough.split(' > ')
from pprint import pprint
pprint(policy_cmds)
>>> ['open antique trunk',
>>> 'take old key from antique trunk',
>>> 'unlock wooden door with old key',
>>> 'open wooden door',
>>> 'go east',
>>> 'open screen door',
>>> 'go east',
>>> 'go south',
>>> 'take half of a bag of chips',
>>> 'go north',
>>> 'go west',
>>> 'put half of a bag of chips on stove']
done = False
obs, infos = env.reset()
for indx, cmd in enumerate(policy_cmds):
obs, score, done, infos = env.step(cmd)
print('_' * 33, "Step.%d, cmd: %s" % (indx, cmd), '_' * 33)
print("score: %d, done: %s, infos: %s\n\n" % (score, done, infos))
Which outputs:
_________________________________ Step.0, cmd: open antique trunk _________________________________
score: 1, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.1, cmd: take old key from antique trunk _________________________________
score: 2, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.2, cmd: unlock wooden door with old key _________________________________
score: 3, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.3, cmd: open wooden door _________________________________
score: 4, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.4, cmd: go east _________________________________
score: 5, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.5, cmd: open screen door _________________________________
score: 6, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.6, cmd: go east _________________________________
score: 7, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.7, cmd: go south _________________________________
score: 8, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.8, cmd: take half of a bag of chips _________________________________
score: 10, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.9, cmd: go north _________________________________
score: 10, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.10, cmd: go west _________________________________
score: 10, done: False, infos: {'won': False, 'lost': False, 'intermediate_reward': 0, 'policy_commands': []}
_________________________________ Step.11, cmd: put half of a bag of chips on stove _________________________________
score: 11, done: True, infos: {'won': True, 'lost': False, 'intermediate_reward': 1, 'policy_commands': []}
It seems like intermediate_reward
and policy_commands
aren't doing what they are supposed to do. For this specific example infos["intermediate_reward"]
should be +1
for every step but it returns zeros for all but the last step. And infos["policy_commands"]
should return the optimal action (series of actions?) at any given stage of the game but it returns an empty list. Is this a bug or should I do something else to have access to this information?
Thanks, Hadi
PR #206 should fix it. Let me know if that works for you.
Tried on tw-simple and custom games and it works
Hi,
It appears like having
policy_commands=True
andintermediate_reward=True
in infos_to_request leads to bugs. For instance,done
andhas_won
returnFalse
even after winning a quest. Here is my test code:Let's make a simple game
Load the game using
policy_commands=True
andintermediate_reward=True
in infos_to_requestGet the optimal sequence of commands from the walkthrough
Play the game using policy commands
policy_commands and intermediate_reward return nothing, but there is more
According to
obs
we have won butdone
andhas_won
are stillFalse
.Trying the same thing but with
policy_commands=False
andintermediate_reward=False
leads to correct behavior fordone
andhas_won
. I tried this using a tw-cooking game and same thing happened. Am I missing something or is this a bug? I upgraded gym to the latest version but it didn't help either.