In your paper, you mentioned that the action scorer module spits out two outputs (one action ("go", "eat"), and one object("east", "apple"). I wonder how does your architecture deals with illegal action such as the following:
given a state s, the possible actions are:
a1: eat apple
a2: go east
However the action scorer will score all possible word in the action ("go", "eat") and objects ("east", "apple"). which results in 4 possible actions
a1: eat apple (legal action) --> score: 0.9
a2: go east (legal action) --> score: 0.08
a3: eat east (illegal action) --> score: 0.01
a4: go apple (illegal action) --> score: 0.01
In such scenario how does your architecture deals with illegal actions? do you just look up the table for only legal actions?
In your paper, you mentioned that the action scorer module spits out two outputs (one action ("go", "eat"), and one object("east", "apple"). I wonder how does your architecture deals with illegal action such as the following: given a state s, the possible actions are: a1: eat apple a2: go east However the action scorer will score all possible word in the action ("go", "eat") and objects ("east", "apple"). which results in 4 possible actions a1: eat apple (legal action) --> score: 0.9 a2: go east (legal action) --> score: 0.08 a3: eat east (illegal action) --> score: 0.01 a4: go apple (illegal action) --> score: 0.01
In such scenario how does your architecture deals with illegal actions? do you just look up the table for only legal actions?