[Feature request] Request for Image Captioning Script for Japanese Text-to-Image Generation

alfredplpl commented 10 months ago

feature

I hope this message finds you well. I am reaching out to extend my heartfelt gratitude for the assistance you rendered the other day. It was of immense help and I am truly appreciative.

I am currently working on a project involving Japanese text-to-image generation and am in need of a script capable of image captioning to advance this endeavor. Your expertise in this domain could significantly accelerate the progress of this project, and I would be extremely grateful for any assistance or guidance you could provide.

For example, I tried to do image captioning by your CLI script. It is great stable-diffusion-xl (1)

USER: このイラストを日本語でできる限り詳細に説明してください。表情や髪の色、目の色、耳の種類、服装、服の色など注意して説明してください。説明は反復を避けてください。
ASSISTANT: このイラストは、日本のアニメやマンガのキャラクターを描いたものです。女性のキャラクターが、耳が大きな狐のような形をした耳を持ち、穏やかな表情をしています。彼女は、紫色のシャツを着て、オレンジ色のジャケットを着ています。彼女の髪は、柔らかく長く、肩まで垂れています。また、彼女は、大きなオーディオカップを両耳に巻いています。このキャラクターは、日本のアニメやマンガのファンにとって、魅力的なデザインとなっています。

I would like to do image captioning for huge (approx. 400k+) images.

If you have an existing script or could point me in the direction of resources or individuals proficient in this domain, it would be greatly appreciated. I am more than willing to discuss this further at your convenience, and am open to collaboration or any form of assistance you could extend.

Thanks in advance.

alfredplpl commented 10 months ago

I made the cli_batch.py . https://github.com/alfredplpl/LLaVA/blob/main/llava/serve/cli_batch.py May I send a PR?

For example, I run the code by the following command:

python -m llava.serve.cli_batch --model-path liuhaotian/llava-v1.5-13b \
--load-8bit \
--prompt このイラストを日本語でできる限り詳細に説明してください。表情や髪の 色、目の色、耳の種類、服装、服の色など注意して説明してください。説明は反復を避けてください。 \
--image-folder '/mnt/NVM/test'  \
--output-csv '/mnt/NVM/test/metadata.csv'

Then, I got the following csv file:

file_name,text
17.jpg,</s>
18.jpg,このイラストは、日本のアニメーションスタジオ「Ghibli」が制作したもので、主人公の女の子が描かれています。彼女は、長い黒髪と瞳の色が琥珀色の瞳を持っています。彼女は、紫色のドレスを着ており、そのドレスは、腰までの長さで、襟元には、白いリボンがついています。彼女の顔は、穏やかで、笑顔を浮かべています。彼女の耳は、小さく、人間の耳の形状をしています。彼女は、手を腹にくるめくように、腕を回しています。</s>
9.jpg,</s>
12.jpg,「私は、紫色の髪と瞳が紫色の、痩せ型の女性です。私は、白い肌を持っており、紫色の髪をポニーテールにまとめています。私は、紫色の瞳を持っており、紫色の髪と瞳が絶妙なハーモニーを生み出しています。私は、白い肌を持っており、紫色の髪と瞳が私の個性的な雰囲気を引き立てています。私は、紫色の髪をポニーテールにまとめています。私は、紫色の髪と瞳が私の個性的な雰囲気を引き立てています。私は、白い肌を持っており、紫色の髪と瞳が私の個性的な雰囲気を引き立てています。私は、紫色の髪をポニーテールにまとめています。私は、紫色の髪と瞳が私の個性的な雰囲気を引き立てています。私は、白い肌を持っており、紫色の髪と瞳が私の個性的な雰囲気を引き立てています。私は、紫色の髪をポニーテールにまとめています。私は、紫色の髪と瞳が私の個性的な��
11.jpeg,「私は、日本のアーティストであり、このイラストを描いた人物です。このイラストは、日本の文化に基づいて描かれており、日本の伝統的な服装を着た女性が描かれています。彼女は、日本の伝統的な衣装である「kimono」を着ており、髪は鮮やかな黒色で、まつ毛は細くて長く、目は大きくて優美で、耳は小さくて垂れ下がっています。彼女の表情は、穏やかで優美なもので、日本の美意識を反映しています。このイラストは、日本の文化を紹介するために描かれたものであり、日本の伝統や文化を理解するための資料としても利用できます。」</s>
817.jpeg,「私は、紫色の髪と瞳が濃い紫色の瞳を持つ、細身で長身の女性です。私は、紫色のシャツと黒色のジーンズを着用しています。私の表情は、微笑を浮かべています。私の髪は、肩まで垂れ下がっています。私の耳は、人間の耳と同じような形状をしています。私の服装は、紫色のシャツと黒色のジーンズで、私はこれらの服装を着用しています。私は、紫色の髪と瞳が濃い紫色の瞳を持つ、細身で長身の女性です。」</s>
19.jpeg,"「私は、紫色の髪に、深い黒い瞳を持つ、年齢は20代の女性です。穏やかな笑顔を浮かべ、落ち着いた表情をしています。髪は肩まで伸び、ストレートで光沢があります。顔の周りは、ほんの少しぼうっとした表情をしています。

私は、白い襟付きシャツを着ています。シャツは、肩まで伸び、襟は下がります。肩から腰にかけては、シャツの下には、白いベルトがついています。ベルトは、シャツの下端まで下がり、腰の周りを巡っています。

私は、白い靴下をはいています。靴下は、白い靴下で、足首まで下がります。足首には、白い靴下を巻いています。

私は、白いカーテンを引き上げて、窓から差し込んでいます。窓からは、昼間の光が差し込んでいます。窓の周りは、白いカーテンで覆われています。カーテンは、窓の上部から下部まで垂れ下がっています。」</s>"

alfredplpl commented 10 months ago

I'm sorry to forget the image embedding in the prompt.

For example, I run the code by the following command:

python -m llava.serve.cli_batch --model-path liuhaotian/llava-v1.5-13b \ 
--load-8bit \
--system-prompt あなたは日本語を喋る人工知能です。誠実に画像をもとに日本語で応答を返してください。 \
--user-prompt このイラストを日本語でできる限り詳細に説明してください。表情や髪の色、目の色、耳の種類、服装、服の色 など注意して説明してください。説明は反復を避けてください。\
 --image-folder '/mnt/NVM/test'  \
--output-csv '/mnt/NVM/test/metadata.csv'

Then, I got the following csv file:

file_name,text
17.jpg,このイラストは、日本語で書かれた漫画のキャラクターである。彼女は、髪を下げて耳にピアスをつけている。彼女は、露出度の高い服を着ており、胸が大きく、腰まで下がっている。彼女は、眉毛が太く、目が大きく、唇が細い。彼女は、洗濯機を抱えており、水滴が彼女の服についている。彼女の服は、白色で、彼女の露出度の高さが強調されている。</s>
18.jpg,このイラストは、女性がオフィスの椅子に座っている場面を描いたものです。彼女は、白い襟付きシャツと青いスカートを着ています。彼女の髪は短く、薄い色で、彼女の表情は笑顔で、彼女は椅子に手を乗せています。また、彼女の周りには、他の2人の人物が描かれています。一人は椅子の前に座っており、もう一人は椅子の後ろに立っています。</s>
9.jpg,このイラストは、雨が降っている中で、雨を受けている女性を描いています。彼女は、白い雨合わせを着て、雨を避けるために白い傘を持っています。彼女は、背中に黒いバッグを背負っています。彼女の髪は黒で、短く、まとまっています。彼女の目は大きく、彼女の表情は不満そうで、彼女は雨に不満を持っているようです。彼女は、耳を閉じています。彼女は、白い襟を着ており、その上には白いドレスシャツを着ています。彼女の服は、白いドレスシャツと黒いスカートで、彼女は雨に濡れています。</s>
12.jpg,"このイラストは、日本語の漫画で描かれた場面です。中心には、眉毛が太く、目が大きく、顔が赤い女性が笑顔で描かれています。彼女は、胸が大きく、腰が太く、短い髪を持っています。彼女は、白いシャツと緑色のスカートを着ています。

彼女の周りには、他の人物がいくつか描かれています。その中には、眼が大きく、顔が赤い女性がいます。また、他の人物は、短い髪を持ち、白いシャツを着ています。

このイラストは、日本語の漫画のストーリーを描いたものであり、女性たちの表情や服装、そして彼女たちがどのような場面でいるかを描いています。</s>"
11.jpeg,このイラストは、翼が生えた女性のキャラクターを描いています。彼女は、腰までの短い裸の服を着ています。彼女の髪は、腰まで伸びており、翼が生えていることが特徴的です。彼女の表情は、優美であり、目は大きく、耳は独特の形状をしています。彼女の服装は、翼が生えた竜のようなデザインで、胸元が開いています。彼女の服は、薄い青色で、翼の羽が混ざり合っています。</s>
817.jpeg,"このイラストは、日本語で説明することができます。

イラストには、紫色の髪と紫色の瞳を持つ女性が描かれています。彼女は、白い肌をしており、紫色の衣装を着ています。彼女の服装には、白い裏地があり、紫色の上着とスカートが着用されています。

彼女の髪は、肩まで伸びており、耳にはピンク色のアクセントがついています。また、彼女の脚には、赤い靴が履いています。

このイラストは、日本語で説明することができます。</s>"
19.jpeg,"このイラストは、日本語で説明することができます。

1. 表情: 女性のイラストは、笑顔をしています。彼女は、紫色の背景に対して、明るく楽しげな表情をしています。

2. 髪の色: 彼女の髪は、紫色であり、豊かで美しい髪型をしています。

3. 目の色: 彼女の目は、紫色であり、美しく大きな瞳を持っています。

4. 耳の種類: 彼女の耳は、紫色であり、彼女の髪と同じ色で美しく描かれています。

5. 服装: 彼女は、紫色の服を着ています。服は、彼女の美しい体型にフィットしており、彼女の髪と同じ色で描かれています。

6. 服の色: 彼女の服は、紫色であり、彼女の髪と同じ色で描かれています。

このイラストは、美しい女性のイラストであり、紫色が主要な色として使用されています。彼女の表情、髪、目、耳、服装、服の色は、彼女の美しさを強調しています。</s>"

rakataprime commented 10 months ago

@alfredplpl Hey I looked through your code and saw that the batch size of 1 was used throughout and I saw that a number of other issues mentioned not being able to increase the batch size greater than 1 for inference. I think for the size the dataset you have you would want to increase the batch size as high as you can go on a v100 or a100. Did you have issues increasing the batch size for this too?

alfredplpl commented 10 months ago

@rakataprime Indeed, a batch size of 1 seems inefficient. I will look into whether the batch size can be changed. Also, if the batch size can be adjusted, I would like to add that option.

rakataprime commented 10 months ago

@alfredplpl I opened up a pr in your repo https://github.com/alfredplpl/LLaVA/pull/1

this example treats the size of the folder as the batch size a toy example, but should be updated to chunk into batch size chunks

this shows batch size being used and included the pr work for fixing the batch size https://github.com/haotian-liu/LLaVA/pull/696 that hasn't been merged yet to main. This gives faster results but I still haven't optimized yet ( the data needs to come in through a dataloader and integrate the pipeline with a ray job so it can be ran on chunked datasets across a ray cluster. I am testing on 7b 4bit settings for 15 512x512 images on a 4090. I commented out the system prompt since I wasn't sure if I needed it for my work which is similar but doing English text to image captions instead.

alfredplpl commented 10 months ago

Thanks to @rakataprime, the execution speed has improved. Also, it has been refactored and looks cleaner. https://github.com/alfredplpl/LLaVA/blob/main/llava/serve/cli_batch.py

For example.

python -m llava.serve.cli_batch --model-path liuhaotian/llava-v1.5-13b \
--load-4bit \
--user-prompt "Please describe this illustration in English as detailed as possible. Pay attention to details such as facial expressions, hair color, eye color, type of ears, clothing, color of the clothing, and the description of the background. Avoid repetition in your explanation." \
--image-folder /mnt/NVM/test   \
--output-csv '/mnt/NVM/test/metadata2.csv' \
--batch-size 4

then

file_name,text
17.jpg,"The image features a woman standing in a room, wearing a towel. She is holding a toothbrush in her hand, possibly brushing her teeth. The woman has black hair and is wearing a white towel. The room appears to be a bathroom, with a sink visible in the background. The woman is the main focus of the scene, and her actions suggest a casual, everyday moment."
alphonse-mucha_zodiac-1896.jpg,"The image features a beautiful woman with long, flowing hair, wearing a crown. She is the central figure in the scene, surrounded by a variety of other people and elements. There are at least 13 other people in the image, some of them closer to the woman and others farther away.

The background is filled with intricate patterns and designs, adding to the overall artistic quality of the image. There are also two clocks visible in the background, one towards the left side and the other towards the right side of the image. The combination of the woman, the people, and the intricate background creates"
18.jpg,"The image features a woman sitting in a chair with her legs crossed. She is wearing a white shirt and blue skirt, and she appears to be wearing black stockings. The woman is holding her chin with her hand, possibly deep in thought or contemplating something.

In the background, there is another person partially visible, sitting at a desk with a laptop. The scene also includes a dining table and a chair placed nearby. The woman's crossed legs and the presence of the laptop in the background suggest that this could be a work or study environment."
9.jpg,"The image features a woman wearing a blue shirt and black pants, holding an umbrella to protect herself from the rain. She is also carrying a backpack on her back. The woman appears to be walking down a street, possibly in a city, as there are multiple cars visible in the background. The scene captures the essence of a rainy day, with the woman trying to stay dry while going about her daily activities."
12.jpg,"The image is a cartoon illustration featuring a woman with a heart in her eye, standing next to a man. The woman is wearing a white shirt and a brown coat, while the man is wearing a tie. Both of them are smiling and appear to be enjoying their time together.

In the background, there are several other people present, but they are not the main focus of the scene. The woman with the heart in her eye seems to be the central figure in the illustration, and her expression conveys a sense of happiness and affection."
11.jpeg,"The image features a woman with blue hair and a blue dress, standing on one leg and holding a sword. She appears to be a warrior or a character from a video game. The woman is positioned in the center of the image, and her sword is held in a ready stance.

In the background, there are two cars visible, one on the left side and another on the right side of the image. The scene also includes a few other elements, such as a chair located near the center of the image and a handbag placed on the ground. The overall composition of the image suggests a dynamic and action-"
817.jpeg,"The image features a young woman dressed in a black and white outfit, standing in a white background. She has long, pink hair and is wearing a black skirt. The woman is also wearing a pair of boots, which are red in color. Her outfit is complemented by a black and white tie, adding a touch of elegance to her overall appearance. The woman's facial expression is neutral, and her eyes are open, giving her a calm and composed demeanor."
19.jpeg,"The image features a woman with long, pink hair and blue eyes. She is wearing glasses and a pink shirt, which complements her unique hair color. The woman appears to be smiling, giving off a cheerful and friendly vibe. 

The background of the image is a blend of pink and purple hues, adding to the overall aesthetic of the scene. The woman's hair is blowing in the wind, giving a sense of motion and liveliness to the image."

hcwei13 commented 8 months ago

When executing the above code, I encountered the following issues:

file_name,text
frame_005.png,"The image captures a female athlete in a red shirt and black shorts, running on a track during a competition. She is in the middle of a race, with her arms outstretched, and appears to be in the process"
frame_015.png,"The image features a man standing on a field, holding a flag with a combination of red, white, and blue colors. He is wearing a black shirt and appears to be celebrating. The man is the main focus of the scene,"
frame_009.png,### explanation.
frame_008.png,### explanation.
frame_012.png,### explanation.
frame_007.png,### explanation.
frame_016.png,### explanation.
frame_013.png,### explanation.
frame_003.png,### explanation.
frame_004.png,### explanation.
frame_010.png,### explanation.
frame_006.png,### explanation.
frame_001.png,### explanation.
frame_011.png,### explanation.
frame_002.png,### explanation.
frame_014.png,### explanation.### explanation.### explanation.### explanation.### explanation.### explanation.### explanation.### explanation.### explanation.### explanation.### explanation.### explanation.###

where batch_size=2. I found issues with all descriptions beyond the first batch. Have you encountered these problems as well? @alfredplpl @rakataprime

haotian-liu / LLaVA

[Feature request] Request for Image Captioning Script for Japanese Text-to-Image Generation #675

feature