Closed gdjmck closed 8 months ago
Thanks for your question. In the observation, there is an action mask indicating which action is feasible. And when sampling actions from the policy network, we set the selection probability according to the action mask. Please check these two lines of codes: land use road
Therefore, the policy network will actually never output infeasible actions. Here the InfeasibleActionError is left for debugging.
In agent.py module, after a action is sampled from _selectaction() method, the action was passed to the env.step method. Inside env.step() method, _self._current_land_usemask object is used to filter unreasonable positions by _self._current_land_usemask[action] == 0, and if it was 0 and throws a InfeasibleActionError.
My question is there is no guarentee that the action would not fall into _self._current_land_usemask with values of 0, and the training would stop for this exception. Was it a problem or should I just not raise the exception and replace it with a random choice of valid position in _self._current_land_usemask?