Excited by the work, great paper and open release.
I am interested in testing some ideas that will involve pretraining (e.g. architecture changes, etc.), likely without access to a real-world setup, at least at first. Just starting to look at the codebase.
Curious about recommendations for sim/offline evaluation. 1) For evaluation, any recs / best practices for separating datasets into train/test/validation, or holding out rt-x datasets. What seem to be the most useful proxies for real-world perf. 2) I saw there are provided examples for sim finetuning, are there any results that could be shared for simulated envs? Are there any sim envs that "work" for testing zero-shot eval in addition to finetuning?
Thanks for your interest and for the great questions!
In general, there is no single offline metric that directly correlates to real world performance, as there are multiple factors that could determine the success rate of a rollout, i.e. besides tracking end-effector position and orientation w.r.t to ground-truth the timing of closing the gripper can be quite critical. But we plot many metrics during training that should provide a good overview.
We are actively looking into sim envs for exactly your use-case, stay tuned for some updates with the next release of Octo!
Excited by the work, great paper and open release.
I am interested in testing some ideas that will involve pretraining (e.g. architecture changes, etc.), likely without access to a real-world setup, at least at first. Just starting to look at the codebase.
Curious about recommendations for sim/offline evaluation. 1) For evaluation, any recs / best practices for separating datasets into train/test/validation, or holding out rt-x datasets. What seem to be the most useful proxies for real-world perf. 2) I saw there are provided examples for sim finetuning, are there any results that could be shared for simulated envs? Are there any sim envs that "work" for testing zero-shot eval in addition to finetuning?
Thanks!