kohya-ss / sd-scripts

Apache License 2.0
5.13k stars 855 forks source link

Regarding the Multi-Level Caption #1643

Open sdbds opened 3 weeks ago

sdbds commented 3 weeks ago

image I'm in the process of building a VLM caption program, perhaps using multi-level caption similar to PG3.

As far as I know most of the DiT model training since SD3 has used a multi-level caption + random matching strategy.

Considering that natural language is becoming more and more popular, maybe we can just add shuffle to read different levels of captions.

Now there are several possible captions:

1, still use separate txt files, but with multiple lines representing different levels of captions(similar to sampler prompt)

2, load multiple text files with different extensions

3, use a dictionary file like json to represent different captions

gesen2egee commented 3 weeks ago

Currently, it is possible to use multiple lines in a .txt file with --enable_wildcards to achieve this, or use json also supported. (but it seems that only finetuning can read json.)

If it's flux when caching te, it can only store one fixed npz. I manually cache different te by copying different folders and using different captions. It would be even better if npz could support saving wildcards."

sdbds commented 3 weeks ago

Currently, it is possible to use multiple lines in a .txt file with --enable_wildcards to achieve this, or use json also supported. (but it seems that only finetuning can read json.)

If it's flux when caching te, it can only store one fixed npz. I manually cache different te by copying different folders and using different captions. It would be even better if npz could support saving wildcards."

I just noticed this, and I guess caching multiple npy's and saving them to a single npz is also an option.

kohya-ss commented 3 weeks ago

Multi-line captions are supported with wildcard notation, so the easiest solution would be to store them in a single .npz file, but that would take some time to implement.