Details on the data used to train in Adroit Sparse environments

How many transitions (or how many episodes/demos) are used for each of the 3 tasks as part of the offline buffer?

I think we only used 22 expert demos, and then 450 BC trajectories. There are a couple of a weird quirks in this dataset. 1) I think the horizons of any given trajectory may actually "overrun" w.r.t. the horizon encoded in the Mujoco environment. A challenge is therefore "task stitching", making sure we also find not just a solution, but the optimal solution. 2) The "expert" demos aren't actually perfect. Sometimes they may fail to solve the task at all. This matters especially when evaluating situations where we have very few expert demos. Fun fact: if you sort the demos by return (big to small) and just use the top 5 demos, RLPD will still work.

What is this BC data / how was the BC data collected?

This is a good question. We got this data from the AWAC paper, so for concrete implementation-level details you may need to reach out to Ashvin Nair. From my brief chats with him, the BC data is derived from a simple 2 layer MLP that tries to clone the human expert data (e.g., we deploy the BC agent 500 times). The "problem" is that the human data now is a POMDP w.r.t. this class of function approximator, because there the human actions are likely to be influenced by the history up to that point in time, rather than purely the current observation. This means the BC agent is likely pretty suboptimal. I didn't do this, but a useful sanity check would be to look at this BC data and see how "good" it is (e.g., return, visualize the trajectory).

ikostrikov / rlpd

Details on the data used to train in Adroit Sparse environments #2