Open kierad opened 1 year ago
Ah, this is kinda interesting, thanks for spotting this (needless to say, I didn't test with negative rewards...)
For the crash: What's happening here is that the IRL-maxent algorithm itself doesn't fail, but the "expert data" generation part of the example. As far as I understand, IRL-maxent (both causal and normal) should support negative rewards just fine (more on that later).
To generate the "expert data", I'm just running a normal value-iteration, use that to compute a stochastic policy for the "expert" by computing Q(s, a) / V(s) (where V(s) = sum_a Q(s, a)), and use that in turn to create some trajectories by executing that policy (see here). And the policy computation part is where this goes wrong... because just dividing those is a bit nonsensical. For positive Q(s, a) and V(s), that should actually work out to a valid probability distribution, but as you've spotted: as soon as any of those values are negative, it doesn't.
What I this should have been doing in the first place is use a softmax to compute the probability distribution. I've fixed that in 918044c110924ddb81c850dc63a1b03e60adc98b.
For recovering negative rewards: I guess theoretically, the algorithm should be able to do that, but it might be hard to "convince" it to do so. Essentially, it just tries to find some reward parameters that results in the same distribution of states visited as in the expert trajectories. And those parameters are generally not unique, meaning multiple parameters can result in the same state visitation frequency (and policy).
Hello, thanks for sharing your code. Is it possible to use this for MDPs with negative reward states?
I've tried setting negative rewards inside
setup_mdp()
inexample.py
, e.g. like:-0.75 seems to be around the lowest I can set - lower than that, and running
example.py
results in an error:And even with the above
setup_mdp
, the IRL methods don't seem to produce negative reward estimates (see the colourbar I've added):True rewards:
Estimated rewards with maxent:
Estimated rewards with causal maxent: