Open ssvision opened 1 month ago
I am also confused about the model’s output. From the example, it can be seen that if the input is a sequence of 6 frames of images with the shape (2, 3, 6, 224, 224), the output is (2, 6, 11, 256). What I can imagine is that after applying argmax or some how, the output should be (2, 6, action). Does this mean that the output represents the actions for the next six time steps? or it has other methods to define one step action?
my current setup consists of a universal robot 6dof UR5e arm along with a onRobot gripper. There is a Intel Real Sense mounted on the head which is static. (assume it's a single arm humanoid robot with camera mounted on the head). Now when i run the model i.e pass an image and an instruction the model is supposed to output 7 values for the action space which are (x, y, z, roll, pitch, yaw, gripper state). My questions are as follows