ikostrikov / rlpd

MIT License
213 stars 24 forks source link

Details on the data used to train in Adroit Sparse environments #2

Closed StoneT2000 closed 1 year ago

StoneT2000 commented 1 year ago

I was reading through the paper but couldn't find details on the exact data used for Adroit Sparse tests.

One question I have is for all the demonstrated results how many transitions (or how many episodes/demos) are used for each of the 3 tasks as part of the offline buffer?

Another question, going through the code I noticed that only 90% of the expert dataset for adroit it seems by default is used, but it also appears that there is behavior cloning data included. What is this BC data / how was the BC data collected?

philipjball commented 1 year ago

How many transitions (or how many episodes/demos) are used for each of the 3 tasks as part of the offline buffer?

I think we only used 22 expert demos, and then 450 BC trajectories. There are a couple of a weird quirks in this dataset. 1) I think the horizons of any given trajectory may actually "overrun" w.r.t. the horizon encoded in the Mujoco environment. A challenge is therefore "task stitching", making sure we also find not just a solution, but the optimal solution. 2) The "expert" demos aren't actually perfect. Sometimes they may fail to solve the task at all. This matters especially when evaluating situations where we have very few expert demos. Fun fact: if you sort the demos by return (big to small) and just use the top 5 demos, RLPD will still work.

What is this BC data / how was the BC data collected?

This is a good question. We got this data from the AWAC paper, so for concrete implementation-level details you may need to reach out to Ashvin Nair. From my brief chats with him, the BC data is derived from a simple 2 layer MLP that tries to clone the human expert data (e.g., we deploy the BC agent 500 times). The "problem" is that the human data now is a POMDP w.r.t. this class of function approximator, because there the human actions are likely to be influenced by the history up to that point in time, rather than purely the current observation. This means the BC agent is likely pretty suboptimal. I didn't do this, but a useful sanity check would be to look at this BC data and see how "good" it is (e.g., return, visualize the trajectory).