kyegomez / RT-2

Democratization of RT-2 "RT-2: New model translates vision and language into action"
https://discord.gg/qUtxnK2NMf
MIT License
367 stars 50 forks source link

How to use model #17

Closed AdrianSkib closed 10 months ago

AdrianSkib commented 1 year ago

Hi.

First of all thank you so much for providing this repo. This model opens up so many possibilities, but only if it's open source.

From the paper I understand that the output of the model is either a string of words, a string of words with an action, or just an action, depending on the output mode. The action consists of a token indicating whether to finish previous action before starting new, changes in x, y, z, rotations around x, y and z, and a desired extension of the end effector (gripper).

The example provided runs smoothly, but I'm unsure how to convert the eval_logits to the above output format.

Could you provide an example of this final step?

Thanks a lot in advance!

Upvote & Fund

Fund with Polar

kyegomez commented 10 months ago

@AdrianSkib this outputs just textual tokens it does not output the action sequences, if you would like to make this happen you can integrate token learner

https://zeta.apac.ai/en/latest/zeta/nn/modules/token_learner/