mikeknapp / candy-machine

Image Dataset Tagger for Stable Diffusion / Lora / DreamBooth Training
Other
22 stars 0 forks source link

KeyError #1

Open khronimo opened 2 months ago

khronimo commented 2 months ago

"one-piece swimsuit" seems to trigger a KeyError.

On a side note, it would be nice to have WD auto tagging as optional as it doesn't suit every dataset.

[2024-06-12 15:16:28,877] ERROR in app: Exception on /project/test/get [POST] Traceback (most recent call last): File "I:\Tools\candy-machine\candy-machine-main\venv\lib\site-packages\flask\app.py", line 1473, in wsgi_app response = self.full_dispatch_request() File "I:\Tools\candy-machine\candy-machine-main\venv\lib\site-packages\flask\app.py", line 882, in full_dispatch_request rv = self.handle_user_exception(e) File "I:\Tools\candy-machine\candy-machine-main\venv\lib\site-packages\flask_cors\extension.py", line 178, in wrapped_function return cors_after_request(app.make_response(f(*args, kwargs))) File "I:\Tools\candy-machine\candy-machine-main\venv\lib\site-packages\flask\app.py", line 880, in full_dispatch_request rv = self.dispatch_request() File "I:\Tools\candy-machine\candy-machine-main\venv\lib\site-packages\flask\app.py", line 865, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(view_args) # type: ignore[no-any-return] File "I:\Tools\candy-machine\candy-machine-main\server\main.py", line 85, in get_project project = Project(project_name) File "I:\Tools\candy-machine\candy-machine-main\server\project.py", line 68, in init self._load() File "I:\Tools\candy-machine\candy-machine-main\server\project.py", line 132, in _load self.auto_tags = self._get_filtered_auto_tags(self.project_layout) File "I:\Tools\candy-machine\candy-machine-main\server\project.py", line 450, in _get_filtered_auto_tags tag_info.to_dict() for tag_info in self._analyze_auto_tags() File "I:\Tools\candy-machine\candy-machine-main\server\project.py", line 500, in _analyze_auto_tags del tags[tag] KeyError: 'one-piece swimsuit'

mikeknapp commented 2 months ago

@khronimo Thanks so much Nic for this report. This should be fixed now if you do a git pull. I've made a note of making the image tagger optional. Would you like a different one, or you just don't want to use one at all? (I'm mainly doing it to recommend new tags for the project. The auto tags don't get applied to images unless you choose to add them.)

khronimo commented 2 months ago

Fantastic. Yeah It would be great to be able to have more options for captioning. I did see your comment about exposing an API, this would be fanatic and extremely useful as there are a host of options like cogLVM, Phi3 vision, LLAVA and some of them are super easy to script with an openapi compatible server like ollama/lm studio.

Perhaps captions could be stored in a json file with a title, subtitle and maybe the prompt used. You could use something like CogLVM a few times with a different focus, e.g. clothing, background. Then everything could be displayed alongside the lists below the image that has "CAPTION .TXT FILE", "AUTO TAGS (FYI)".

e.g.:

"CAPTION .TXT FILE" "CogLVM" - "Clothing" "CogLVM" - "Background" "LLava" - "" "AUTO TAGS (FYI)"

mikeknapp commented 2 months ago

Oh interesting! I'm learning a lot from you - thanks - I hadn't even considered going into this much detail with the autotags!

Out of interest, what size datasets are you working with, and how many datasets are you creating on average per month? Is this something you'd want to do for all of them, or a subset?

Also, do you use this workflow because you don't want to caption the images yourself (i.e. you want an automated solution), or because the LLMs are coming up with better descriptions than you could have thought of yourself?

(Context: I originally wasn't going to add auto tagging to Candy Machine, but only did so because I wanted an "overview" of the tag vocabulary of a dataset so that I could create the initial set of tags. I'm still trying to figure out the "optimal" workflow.)

mikeknapp commented 2 months ago

I'm playing around with CogVLM2 now. Can you share what prompt(s) you would use to caption an image?

khronimo commented 2 months ago

Out of interest, what size datasets are you working with, and how many datasets are you creating on average per month? Is this something you'd want to do for all of them, or a subset?

I have millions of untagged images, at this stage I've been manually going through and building datasets from selections of them. At this stage I've only been doing a couple a month. I've got some good ideas on how to manage catalogues at scale but haven't built anything around that yet.

Also, do you use this workflow because you don't want to caption the images yourself (i.e. you want an automated solution), or because the LLMs are coming up with better descriptions than you could have thought of yourself?

A lot of people use automated captioning (I believe SD3 was largely captioned by coglvm). I feel the models are almost there but not quite, they miss a lot of detail and they they hallucinate too much leading to lower quality captions. But things are improving fast as things scale and the quality of the datasets used to train them improve. At this stage i like the idea of taking a semi-supervised approach, and either manually add suggested tags to go over them to remove erroneous ones. By using a combination of models and prompts to try and drag out the most detail and manually checking for errors, I think it's possible to produce far better datasets in less time.

(Context: I originally wasn't going to add auto tagging to Candy Machine, but only did so because I wanted an "overview" of the tag vocabulary of a dataset so that I could create the initial set of tags. I'm still trying to figure out the "optimal" workflow.)

I'm still figuring things out there too.

I'm playing around with CogVLM2 now. Can you share what prompt(s) you would use to caption an image?

Honestly, pretty much still experimenting with those things. I did a lot more playing around previously with the LLaVA 1.6 Mistral and Vicuna models when they were SOTA (State of the art).

Here is one example using an old LLava 1.6 Mistral 7b model, with a bit of work and playing around I think it would be possible to create a nice workflow using several models and prompts to get structured json results with some nice suggested tags, and a lot better than just running it through something like WD alone. One thing for sure, it's got a hell of a lot of potential.

matrix01

Prompt:

Examine the following image closely and provide detailed information about this image, use following format to organize your response:

Keywords describing the lighting of the scene: keyword 1, keyword 2, keyword 3, ...
Keywords describing the Framing of the image:
Keywords describing scene location and background: 
Keywords listing objects in the scene:

Instructions:
Use the format provided above.
Use the same titles.
Include all categories
Put keywords inline with the category and separate them with a coma ","
Be very descriptive and include as many keywords as possible
Do not create new categories

Response:

Keywords describing the lighting of the scene: dim, subdued, moody, artificial, greenish tint Keywords describing the Framing of the image: close-up, side by side, corridor, hallway, doors, windows Keywords describing scene location and background: indoor, urban, futuristic, dystopian, modern, sleek, minimalist, metallic, concrete, glass, mirrors, lights, shadows, reflections Keywords listing objects in the scene: people, clothing, accessories, weapons, technology, architecture, design, style, fashion, leather, sunglasses, coats, ties, belts, shoes, hair, makeup, facial expressions, body language, posture, attitude, expression, emotion, mood, atmosphere, setting, environment

The image depicts a scene with two individuals standing in a hallway or corridor. The lighting is dim and subdued, creating a moody and artificial atmosphere with a greenish tint. The framing of the image is close-up, focusing on the two people side by side. The background features an urban, futuristic, dystopian setting with modern, sleek, minimalist, metallic, concrete, glass, mirrors, lights, shadows, and reflections.

The scene location appears to be indoors, possibly within a building or facility. The architecture is characterized by doors, windows, and other architectural elements that contribute to the overall futuristic and dystopian aesthetic of the image.

In terms of objects in the scene, there are two people wearing clothing, accessories, and sunglasses. They are also carrying weapons and technology, which further emphasizes the futuristic and dystopian theme. The presence of these items suggests that the individuals may be involved in some sort of conflict or confrontation.

The overall mood and atmosphere of the image convey a sense of tension and unease, with the two people standing in the hallway looking serious and focused. Their body language and facial expressions suggest that they are prepared for whatever comes their way. The image does not provide enough information to determine the exact nature of their relationship or the context of the scene.

Prompt:

Examine the following image closely and provide detailed information about it. Use the following format to organize your response:

Exposure: [Bright/Balanced/Dim]
Location: [Outdoors/Indoors]
Multiple People: [Yes/No]
Framing: [Full Body/Medium Shot/Close-up on Body Part/Wide Shot]
Clothing: [Clothing item(s)]

Response: Exposure: Dim Location: Indoors Multiple People: Yes Framing: Close-up on body parts (specifically, the individuals are shown from the waist up) Clothing: Leather jackets and sunglasses

The image features two people standing in a hallway or corridor. The exposure is dim, creating a moody and artificial atmosphere. The location appears to be indoors, possibly within a building or facility. The framing of the image is close-up on body parts, specifically from the waist up. The individuals are wearing leather jackets and sunglasses, which further emphasizes the futuristic and dystopian theme of the scene.