What about the explicit text?

lyf1212 commented 5 months ago

Thank you for your wonderful work! I am really curious about the "clean text" after the "Textual Restoration", have you tried to decode it? Or there is no way to decode it into some words which can be understood by human but only implicit feature vector? If so, why do you claim your method as "text restoration"? It is just like some auto-regression depend on self-attention.

mrluin commented 3 months ago

Thanks for your attention. We have tried decode projected words back into the image, and some of the words indeed reflect degradation patterns of restoration tasks we choose in training phase. But we have not tried to decode it into explicit text.

lyf1212 commented 3 months ago

Thank you for your reply. I have some dense concerns:

As you depict in your answer, you can get reasonable results by project the "clean text embedding" back to the image space, so what's the meaning of restoration in textual space? Why not implement this function in image space directly?
It is a lack of analysis of the function and result of your proposed "img-to-text" module and "text-restoration" module, which is your core claim in this paper. I notice you use simple LDM loss function during the training of phase1, so it is hard for me to understand this trained two part thotoughly.
As you utilize other SotA restoration models directly, and the enhance of psnr and ssim is less in most of your experiment, I think it is more important for you to provide more theoretical or experimental analysis about "img-to-text" module and "text-restoration" module. Probably it is more significant to use image embeddings which has more details and high level infos rather convert them into textual space. Thanks again for your nice work which inspires me a lot!

mrluin / TextualDegRemoval

What about the explicit text? #2