Closed guyko81 closed 2 years ago
Yes, I think this is a promising idea, especially for scaling up the number of policy parameters. There is some prior work on deep unsupervised RL that takes a similar approach to training autonomous agents: Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning and DADS.
I implemented a simple version of this suggestion here. The reward r_t for the state transition (s_t, xt, s{t+1}) is simply the individual term from the sum in Equation 2 in the paper. Inside envs.RewardModelCursorEnv
, we periodically re-train the mutual information estimator parameters \phi and \psi on a rolling window of the last N episodes, then use the trained estimator to compute the mutual information rewards for future transitions.
Sorry, but I don't see how the user plays a role in the current notebook example. As far as I understood the PPO is trained on a random setting now. Is it just because it's just a simplified version? My original idea was to replace the gp_optimizer with a simple RL algorithm, so each component of a learning process could be replaced. For any environment (montezuma revenge, tetris, bionic arm, etc.) this MI process could serve as a reward generator tool, while the solution could come from any RL algorithm (the current state of the art ones, which is right now the PPO/TD3, or the new ones, like the diffuzor). And the reward generator tool could be based on user feedback. This way the reward could be biased over time, so it could serve as a reward engineering mechanism.
The current implementation simulates the user's commands, since I haven't tried tuning PPO to be sample-efficient enough for a real human user yet.
Hi,
I thought it's the easiest to ask here: I was wondering if simply calculating the mutual information between the user input $x_t$ and the state $st$ and $S{t+1}$, and giving it to the system as a simple $r_t$, then any general RL algorithm (TD3, DDPG, PPO) could use it as the target. And by maximizing the mutual information the agent could navigate as per the user request. I probably miss something important, but can you give a feedback please?
Thanks!