joeyz0z / ConZIC

Official implementation of "ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing"
MIT License
72 stars 17 forks source link

A puzzled question when run zero-shot demo #13

Closed CrazyBrick closed 1 year ago

CrazyBrick commented 1 year ago

Thank you very much for your excellent work! I am using the script you mentioned("To run zero-shot captioning on images"), and there is a question that ConZIC prioritizes or tends to recognize women in image descriptions (many men are described as women).What are the possible reasons for this problem? Is it related to the distribution within the training dataset you used, or is it related to the alpha and beta sizes (I used - alpha 0.5- beta 1.5)

joeyz0z commented 1 year ago

Could you provide your test images to me so that I can reproduce and analyze the gender bias problem you met.

CrazyBrick commented 1 year ago

@joeyz0z sure! Could you provide your email address

joeyz0z commented 1 year ago

My email is zzequn99@163.com.

joeyz0z commented 1 year ago

I have received your email. The test images and results are interesting(KunKun). After some trials, I find the reason why generated texts are not high-matching to image contents(reported clip score<0.3) is that alpha=0.5 is too high, which makes word distribution close to the original predictions of BERT instead of ConZIC. Our default value of alpha is 0.02, and only minor adjustments are recommended. Besides, for multiple images, we recommend you to use run.py which is designed to infer on image datasets(batch size > 1), demo.py is designed for simple demonstration, now we have a more convenient UI demo, python app.py. The following are some results of your test images on the default hyper-parameters setting. image image

CrazyBrick commented 1 year ago

Thank you for your sincere and prompt reply, it really helped me a lot. (I think that when the sentence length is too long, it will have a lot of redundant and wrong information(Inaccurate), this is why I started to set length=7, and increased the fluency alpha, and reduced the beta) I've tried all your advice:

(From current results, I find it quite interesting that ConZIC seems to adopt OCR or something like this, Is this the content you added, or is it part of the baseline?)

I like the results in your pictures, how can I fully repeat your results

joeyz0z commented 1 year ago

Usually, clip score=0.38 is considered a good result and clip score is positive with sentence length. In branch v0.2, we add a simple trick named stable_replace, to some extent, can help faster convergence and higher score. Currently, we didn't adopt OCR and others. In fact, the original CLIP has claimed its ability to do zero-shot OCR. The results in my pictures are conducted from code in branchv0.2.