uidilr / gail_ppo_tf

Tensorflow implementation of Generative Adversarial Imitation Learning(GAIL) with discrete action
MIT License
111 stars 29 forks source link

Discriminator optimisation with each [state,action] #6

Closed shamanez closed 6 years ago

shamanez commented 6 years ago

HI, When you optimizing the discriminator to output probabilities(later take it as a reward) for each [state, action] tuple, you consider the whole batch.

By doing this don't we loos the sequential behavior of actions? Also we loose start to end connection of a trajectory. Because the neural network will only take one pair into account in a single time step.

What about using an LSTM or RNN?

uidilr commented 6 years ago

Changing networks of discriminator and Policy to LSTM or RNN may be helpful for POMPD. See this page for discussion on usefulness of sequential model in RL.

If I was not mistaken, information of "start to end connection of a trajectory" is not required in the GAIL formulation.

shamanez commented 6 years ago

Generally speaking, since we lose all the temporal structure when optimizing the reward generator (Discriminator of GAN) isn't it a problem?

In normal GAN for generating images, it is totally fine. But in this type of scenario ??

shamanez commented 6 years ago

Let's say we use GAIL to optimize an RL agent who can guide humans to complete some set of tasks in an office environment. We say the state is represented by the visual information and the actions are moving between places like photocopy machine, table, computer etc. So if when it comes to optimizing the discriminator we will always input to it same sates and actions again and again.

I understand that we input state-action pairs, so basically, the discriminator will consider how good is the action given the state(If I am not wrong).

But I feel like this will make this architecture kind of data inefficient. Because there can be many ways of doing a certain task which might not be represented in the expert's data set. Especially in the office scenario.

uidilr commented 6 years ago

Usefulness of sequential models (e.g. LSTM, and RNN) depends on the type of environment.

In MDP, reward is defined as one of r(s), r(s, a), and r(s, a, s'). It means reward of current state must be irrelevant with past information. From this perspective, in MDP, sequential models is not suitable for discriminator in imitation learning. Although Policy gradients algorithms (e.g. TRPO, and PPO) do not assume MDP, using sequential models for discriminator possibly makes it difficult to learn good policy because reward may be highly stochastic.

In POMDP setting, using sequential models for discriminator and Policy may help to work.

In your office environment, the visual information seems not sufficient information to solve the task. It is useful to use sequential models for Discriminator and Policy.

shamanez commented 6 years ago

"In your office environment, the visual information seems not sufficient information to solve the task. It is useful to use sequential models for Discriminator and Policy."

Thanks a lot for the response. That's what I was thinking too. Because in my office task I have very few visual frames which are drastically different from each other. In cart-pole problem state is continuous. So it's fine to optime state and action pairs as a batch when it comes to the discriminator. But in my setting state visual information has not much of variety. Even though there are lot pixels, there are only few interactive objects.

So to confirm your advice do you think I should use RNN structure to both discriminator and policy ?

Thanks a lot

uidilr commented 6 years ago

So to confirm your advice do you think I should use RNN structure to both discriminator and policy ?

Yes, I think you should.