Mostly the pictures in the fine-tuned corpus have a clear focus on something, such as a character, a puppy, etc. But the picture content is more mixed cases, such as publicity posters, on which there are characters, scenery, and very important text (both in Chinese and English), and even some proper nouns, such as China Petrochemical, China Mobile, etc., fine-tuning is not very good, may I ask is the fine-tuning of the content of the corpus picture is there a specific requirement?
This is an interesting and broad topic in the realm of Diffusion models. We can offer some suggestions for fine-tuning:
For training an instant LoRA (for example, an anime character), it is preferable to use several images that maintain consistent appearance.
For training a style LoRA (for example, ink painting), the more images you can use, the better. Additionally, the LoRA rank should be high.
When synthesizing Chinese characters, the base model may struggle with generating long contexts, even if some models utilize powerful LLMs for encoding (such as the ChatGLM in Kolors). Current diffusion models are primarily capable of synthesizing simple, short texts.
Mostly the pictures in the fine-tuned corpus have a clear focus on something, such as a character, a puppy, etc. But the picture content is more mixed cases, such as publicity posters, on which there are characters, scenery, and very important text (both in Chinese and English), and even some proper nouns, such as China Petrochemical, China Mobile, etc., fine-tuning is not very good, may I ask is the fine-tuning of the content of the corpus picture is there a specific requirement?