Closed ShengranHu closed 3 months ago
Thanks so much for your interest! VLM has no information about the object id but LLM has. Hence, by imitation learning, VLM will learn to manipulate the correct objects. I hope this can address your question.
Thanks for your answer.
I think I now understand how it works, but it is still somewhat counter-intuitive to me. For held-out tasks (e.g. seen/unseen valid/test set), the VLM can not get any info about the object ID, and the object ID it learned from training will not help. I am surprised that it can still predict the correct object ID in this scenario (I think that's even theoretically impossible).
Would you consider releasing the evaluation code for your model in ALFworld sometime? Thanks.
I understand your concern. We actually do imitation learning on the 134 unseen tasks (unseen on the SFT stage), which is exactly consistent with other baselines such as vision-only or language-only agents. Moreover, this is consistent with the common evaluation methodology in imitation or reinforcement learning (i.e., training and evaluation within the same environment). How to get accurate object ID from only visual input is still an open problem, and the methods from the field of scene graph construction may help. We will continue to explore this direction. Thanks so much for your insightful comments
And yes, I will release the evaluation code.
Thank you very much for your answer!
Hello. It's a nice paper and thank you for releasing the model code.
I have a question about your environment setting. From your paper and your sft data, the inputs of the VLM for each step only includes task description and the pixel observation, and the action space for the VLM agent is something like "pick up pan 2". However, I am confused about how can VLM know the object IDs from the pixel observation (e.g. why does it know the pan in the pixel observation is pan 2?) Or does the VLM agent also get observation in the text?
Thank you in advance.