How did VLM get object id from the pixel observation?

stevenyangyj / Emma-Alfworld

Official code for the paper: Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld

33 stars 0 forks source link

How did VLM get object id from the pixel observation? #2

Closed ShengranHu closed 3 months ago

ShengranHu commented 3 months ago

Hello. It's a nice paper and thank you for releasing the model code.

I have a question about your environment setting. From your paper and your sft data, the inputs of the VLM for each step only includes task description and the pixel observation, and the action space for the VLM agent is something like "pick up pan 2". However, I am confused about how can VLM know the object IDs from the pixel observation (e.g. why does it know the pan in the pixel observation is pan 2?) Or does the VLM agent also get observation in the text?

Thank you in advance.

stevenyangyj commented 3 months ago

Thanks so much for your interest! VLM has no information about the object id but LLM has. Hence, by imitation learning, VLM will learn to manipulate the correct objects. I hope this can address your question.

ShengranHu commented 3 months ago

Thanks for your answer.

I think I now understand how it works, but it is still somewhat counter-intuitive to me. For held-out tasks (e.g. seen/unseen valid/test set), the VLM can not get any info about the object ID, and the object ID it learned from training will not help. I am surprised that it can still predict the correct object ID in this scenario (I think that's even theoretically impossible).

Would you consider releasing the evaluation code for your model in ALFworld sometime? Thanks.

stevenyangyj commented 3 months ago

I understand your concern. We actually do imitation learning on the 134 unseen tasks (unseen on the SFT stage), which is exactly consistent with other baselines such as vision-only or language-only agents. Moreover, this is consistent with the common evaluation methodology in imitation or reinforcement learning (i.e., training and evaluation within the same environment). How to get accurate object ID from only visual input is still an open problem, and the methods from the field of scene graph construction may help. We will continue to explore this direction. Thanks so much for your insightful comments

And yes, I will release the evaluation code.

ShengranHu commented 3 months ago

Thank you very much for your answer!