qzed / irl-maxent

Maximum Entropy and Maximum Causal Entropy Inverse Reinforcement Learning Implementation in Python
MIT License
221 stars 56 forks source link

Supporting MDPs with negative reward states? #4

Open kierad opened 1 year ago

kierad commented 1 year ago

Hello, thanks for sharing your code. Is it possible to use this for MDPs with negative reward states?

I've tried setting negative rewards inside setup_mdp() in example.py, e.g. like:

def setup_mdp():
    """
    Set-up our MDP/GridWorld
    """
    # create our world
    world = W.IcyGridWorld(size=5, p_slip=0.2)

    # set up the reward function
    reward = np.zeros(world.n_states)
    reward[-1] = 1.0
    reward[17] = -0.75
    reward[18] = -0.75
    reward[19] = -0.75

    # set up terminal states
    terminal = [24]

    return world, reward, terminal

-0.75 seems to be around the lowest I can set - lower than that, and running example.py results in an error:

Traceback (most recent call last):
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 141, in <module>
    main()
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 113, in main
    trajectories, expert_policy = generate_trajectories(world, reward, terminal)
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 51, in generate_trajectories
    tjs = list(T.generate_trajectories(n_trajectories, world, policy_exec, initial, terminal))
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 128, in <genexpr>
    return (_generate_one() for _ in range(n))
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 126, in _generate_one
    return generate_trajectory(world, policy, s, final)
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 77, in generate_trajectory
    action = policy(state)
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 169, in <lambda>
    return lambda state: np.random.choice([*range(policy.shape[1])], p=policy[state, :])
  File "mtrand.pyx", line 956, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities are not non-negative

And even with the above setup_mdp, the IRL methods don't seem to produce negative reward estimates (see the colourbar I've added):

True rewards: reward_estimate_true

Estimated rewards with maxent: reward_estimate_maxent

Estimated rewards with causal maxent: reward_estimate_maxent_causal

qzed commented 1 year ago

Ah, this is kinda interesting, thanks for spotting this (needless to say, I didn't test with negative rewards...)

For the crash: What's happening here is that the IRL-maxent algorithm itself doesn't fail, but the "expert data" generation part of the example. As far as I understand, IRL-maxent (both causal and normal) should support negative rewards just fine (more on that later).

To generate the "expert data", I'm just running a normal value-iteration, use that to compute a stochastic policy for the "expert" by computing Q(s, a) / V(s) (where V(s) = sum_a Q(s, a)), and use that in turn to create some trajectories by executing that policy (see here). And the policy computation part is where this goes wrong... because just dividing those is a bit nonsensical. For positive Q(s, a) and V(s), that should actually work out to a valid probability distribution, but as you've spotted: as soon as any of those values are negative, it doesn't.

What I this should have been doing in the first place is use a softmax to compute the probability distribution. I've fixed that in 918044c110924ddb81c850dc63a1b03e60adc98b.

For recovering negative rewards: I guess theoretically, the algorithm should be able to do that, but it might be hard to "convince" it to do so. Essentially, it just tries to find some reward parameters that results in the same distribution of states visited as in the expert trajectories. And those parameters are generally not unique, meaning multiple parameters can result in the same state visitation frequency (and policy).