microsoft / GenerativeImage2Text

GIT: A Generative Image-to-text Transformer for Vision and Language
MIT License
551 stars 69 forks source link

About generation results. #19

Closed victorup closed 1 year ago

victorup commented 2 years ago

Hi,

When I use GIT_LARGE_COCO to generate captions, the results show many "[ unused0 ]" tokens in the captions.
For example:

So what is "[ unused0 ]"? Does it mean an unknown word? Why it generates many "[ unused0 ]" tokens?
How could I avoid these situations?

Thanks!

amsword commented 2 years ago

can you first try the image shared in the readme and see if there is still such [unused0]? In the meanwhile, can you also share the image you are trying to test? I can test it on my side for further investigation.

victorup commented 2 years ago

The caption of the image in the readme can be generated very well. Sorry, I'm not convenient to share the images. But I further checked my testset, and find that the "[ unused0 ]" tokens usually appear when there is any strange object. So I think the token may mean an unknown word. I will clean my data for better results. Thank you!

prashantkandel12 commented 2 years ago

I've got the same issue and i can paste that image with caption image

the output is : [unused0] sticking his tongue out

amsword commented 2 years ago

Thanks for reporting this. We will investigate more on this issue. I tried the base-version, which can give reasonable results.

amsword commented 2 years ago

In CC12M, the person's name is replaced as \<PERSON>, while here [unused0] is used to replace such special token. CC12M is used in LARGE model, but not in BASE model. [unused0] can be re-interpreted as a person

Original CC12M data examples:

['The source of Anime quotes & Manga quotes : Photo <PERSON>, Manga Quotes, '
 'Art Images, Fan Art, Thoughts, Think, Anime, Crying, Random\n',
 '<PERSON> with Bindi, <PERSON> and <PERSON> before he left the zoo, and lost '
 "contact with his late son's family. Photo: Getty Images\n",
 'The wedding of <PERSON> and Ashleigh McDonald Photography 11\n',
 'An artist rendering shows Supreme Court Justices from left, <PERSON>, '
 '<PERSON>, <PERSON>, <PERSON>, Chief Justice <PERSON>, <PERSON>, <PERSON>, '
 '<PERSON>, and <PERSON> inside Supreme Court in W\n',
 "'(Day 10) <PERSON>'s team clinches the BAT Grad Academy's 'Best Place to "
 "Work For' award at the business simulation!'' Do you aspire to work in top "
 'global company? Look no further. BAT is well known for being one of the '
 "world's best companies to work at, certified as a Top Employer around the "
 'world. BAT Malaysia has also won several HR excellence awards, leading in '
 'several categories including Employee Engagement and Best Companies to Work '
 'for in Asia. This is testament to the initiatives and efforts invested into '
 "their people agenda.'\n"]

cc12M paper image

prashantkandel12 commented 2 years ago

Thank you for the answer. Results were same for me too base model had no issues but large model gave me these results.

DDuan-zw commented 2 years ago

anyway to solve it?

amsword commented 2 years ago

one way is to retrain the large model by not using such special characters. I will try to do this.

prashantkandel12 commented 1 year ago

I have noticed another interesting thing with the generation results: It is giving same output "digital art selected for the #" for all of the following images: All of these images were generated by using stable diffusion. image image image image image image image

amsword commented 1 year ago

@prashantkandel12 @DDuan-zw I removed the offensive captions in cc12m dataset and retrained the large-sized model. Please check the details here.

For the image of Einstein, the model of GIT_LARGE_R_COCO will predict: ‘a black and white photo of a man sticking his tongue out.’.

image

amsword commented 1 year ago

I have noticed another interesting thing with the generation results: It is giving same output "digital art selected for the #" for all of the following images: All of these images were generated by using stable diffusion. image image image image image image image

@prashantkandel12 I tried the model of GIT_LARGE_COCO, the output is as follows. I assume you were using the pretrained model. For a demo purpose, it is recommended to use the fine-tuned ones as the pretraining dataset is quite noisy.

3d rendering of a woman wearing a virtual reality headset.