Have you ever pretrained the vae or unet in anytext_v1.1.ckpt？

Zhaohuii-Wang commented 2 months ago

I found that the unet parameters in anytext_sd15_scratch.ckpt(from sd1.5) and anytext_v1.1.ckpt are not the same. Have you ever pretrained the vae or unet in anytext_v1.1.ckpt？If so, which contradicts that the unet is frozen in the paper. At the same time, when I use about 200000 of my own data(maybe difficult to be generated, because they are real handwriting paper images) finetune anytext_sd15_scratch.ckpt, I find that it is almost impossible to produce effective glyphs. The ablation experiment in the paper (200000 data) uses anytext_sd15_scratch.ckpt or anytext_v1.1.ckpt. Whether a large amount of data is needed to support the generation of valid glyphs? Thanks!

tyxsspa commented 2 months ago

Hi, all unet or vae weights are frozen during training. For anytext_v1.1.ckpt, the unet was replaced with a community model(Realistic_Vision) which is more aesthetically appealing than original SD1.5(as mentioned in Appendix 7 in the paper). All ablation exps are trained using anytext_sd15_scratch.ckpt, which trains anytext "from scratch" except some copied weights from unet in base model. Regarding your training issue, I wonder if the OCR annotations are accurate enough for your handwriting paper images? if so, it probably has no problem for training(change back anytext_v1.1.ckpt's unet to original SD1.5 and continue training may be better than training from scratch); if not, say, the handwriting characters are too hard for the OCR model to recognize, it might indeed not suitable for training. BTW, would you like to share one or two of your handwtring paper images?

Zhaohuii-Wang commented 2 months ago

Of course, these are a few examples of my training process, which is the result of retraining my vae and unlocking all unet parameters in the training process of controlnet. In my data, there are not only handwriting, but also a lot of print, more than 95% of which are paper data of the real scene. Training log image, it worked well: masked text is "super" masked text is "fish" Just change the mask position and text: The "r" of ”tensor" becomes "k", there are other errors in the verification set, for example, some textures in inpainting are confused and unreal. If I do not replace vae, the model seems to be difficult to adapt to my data, and the generated glyphs being bold, obviously different from the surrounding texture, very unreal. And because of the replacement of vae, i need to unlock the unet or replace it with a unet that uses my data to pretrained to adapt to the data distribution, otherwise the decoded image will be very cluttered. But the problem is that if I use anytext_v1.1.ckpt to unlock finetune, it works well during training, but in the case of verification set or just replacing the mask position, the effect will decline, and if I finetune the anytext_sd15_scratch.ckpt or my own pretrained unet, I will not be able to generate valid glyphs. the true text is "proud"

tyxsspa commented 2 months ago

it works well during training, but in the case of verification set or just replacing the mask position, the effect will decline

obviously, the model has overfitting on the training data. I think unlocking all unet parameters and training on only 200,000 images will inevitably cause that.

and if I finetune the anytext_sd15_scratch.ckpt or my own pretrained unet, I will not be able to generate valid glyphs

also replaced with your own pretrained vae right? This seems like underfitting on this small but hard dataset.

In my opinion, first, as you have trained your own vae on this specific data, sure you need replace it in anytext_v1.1.ckpt. Then, as the latent space has changed, you must also use the matched pretrained unet, rather than unlock all unet during training(may cause overfitting on small and specific dataset). During training, I think freeze your own pretrained vae and unet is necessary, and fintuning all other weights on AnyWord-3M, which anytext_v1.1.ckpt is original trianed. After all other weights match with the new vae and unet, and gain on the previous capacity on a relatively large dataset(3M), you can try finetuning on your own small but hard data.

hyp22 commented 1 month ago

I would like to ask how the json file in the data set is generated when you are training,

tyxsspa / AnyText

Have you ever pretrained the vae or unet in anytext_v1.1.ckpt？ #87