Open anh-nn01 opened 1 month ago
Yes, the datasets that we released are sourced from the replay buffers of single-task agents, meaning that we train each single-task agent for 3M steps and save the entire replay buffer. Aside from the initial seeding phase where random actions are taken, actions are therefore generated by agents at various stages of training which makes the dataset quite diverse. Training an offline RL algorithm strictly on random data is unlikely to yield much success since most tasks cannot be solved with a random policy (e.g. pick and place would not be solved from data of a gripper moving around randomly in 3D space).
If possible, can you also share/publicize the code used for offline data collection?
We rely on torchrl for our replay buffer implementation. The Storage
object has a save
function which saves the replay buffer to disk. Documentation: https://pytorch.org/rl/stable/reference/generated/torchrl.data.replay_buffers.Storage.html#torchrl.data.replay_buffers.Storage.save
I am implementing the data collection for the offline dataset to train the multitask model for a different set of tasks outside mt80.
According to the paper:
The datasets are sourced from the replay buffers of 240 single-task agents
. Does this mean that the transitions in offline data use actions fromtdmpc2.act()
, where tdmpc2 is the trained world model for each task? Do you observe any convergence issue and/or sample efficiency reduction if we used random action fromenv.rand_act()
instead of from the agent policytdmpc2.act()
?If possible, can you also share/publicize the code used for offline data collection? It would be extremely helpful for anyone who wants to start their research and use TDMPC2 as a benchmark.
Thank you very much!