rddy / mimi

Code for the paper, "First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual Information Maximization"
MIT License
23 stars 2 forks source link

question: mutual information as simple reward sign for general RL algorithms #3

Closed guyko81 closed 2 years ago

guyko81 commented 2 years ago

Hi,

I thought it's the easiest to ask here: I was wondering if simply calculating the mutual information between the user input $x_t$ and the state $st$ and $S{t+1}$, and giving it to the system as a simple $r_t$, then any general RL algorithm (TD3, DDPG, PPO) could use it as the target. And by maximizing the mutual information the agent could navigate as per the user request. I probably miss something important, but can you give a feedback please?

Thanks!

rddy commented 2 years ago

Yes, I think this is a promising idea, especially for scaling up the number of policy parameters. There is some prior work on deep unsupervised RL that takes a similar approach to training autonomous agents: Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning and DADS.

rddy commented 2 years ago

I implemented a simple version of this suggestion here. The reward r_t for the state transition (s_t, xt, s{t+1}) is simply the individual term from the sum in Equation 2 in the paper. Inside envs.RewardModelCursorEnv, we periodically re-train the mutual information estimator parameters \phi and \psi on a rolling window of the last N episodes, then use the trained estimator to compute the mutual information rewards for future transitions.

guyko81 commented 2 years ago

Sorry, but I don't see how the user plays a role in the current notebook example. As far as I understood the PPO is trained on a random setting now. Is it just because it's just a simplified version? My original idea was to replace the gp_optimizer with a simple RL algorithm, so each component of a learning process could be replaced. For any environment (montezuma revenge, tetris, bionic arm, etc.) this MI process could serve as a reward generator tool, while the solution could come from any RL algorithm (the current state of the art ones, which is right now the PPO/TD3, or the new ones, like the diffuzor). And the reward generator tool could be based on user feedback. This way the reward could be biased over time, so it could serve as a reward engineering mechanism.

rddy commented 2 years ago

The current implementation simulates the user's commands, since I haven't tried tuning PPO to be sample-efficient enough for a real human user yet.