Question on the Pretrain Datasets

Your work is absolutely inspiring and solid. But I still remain some questions:

In pretrain process, what datasets is adopted for those high-cost tasks? Is the dataset large-scale? Or could be arbitrary?
I notice that in your datasets code files you modified several low-cost degs like JPEG. Blur. Noise on the clean images, and there is no high-cost deg in the whole pretraining process. How could the encoder benefit from such pretraining process for downstream task with complex deg?

您好！我非常欣赏您的工作，这对于我现在进行的预训练工作非常有启示意义。我本人也十分敬仰贵组的工作，前不久还近距离聆听了董超老师的汇报。现针对该工作有以下几点不解： 1、预训练过程中采用的是什么数据集？需要有large-scale的特点吗？还是任意形式的clean image就可以了？ 2、我注意到您发布的代码中，在datasets文件里，对于clean image只是添加了一些low-cost退化从而得到reference和input，尽管他们是不同的退化和不同的内容，但是本质上都是low-cost的退化。只是在low-cost的退化上进行退化迁移就可以使encoder学到更本质的特征，从而可以提高在high-cost任务的表现吗？还望您不吝赐教！感谢！

你好，感谢关注我们的工作。以下是针对你提的问题的回答： 1、预训练使用的是DF2K数据集，我们对这个数据集进行了切patch操作，大概得到了2w+ 480x480的图像块，然后在这些图像块上进行随机采样训练。我们后续也在ImageNet上进行了预训练，实验结果发现ImageNet上预训练在部分下游任务（例如去雾）上还能够继续提升。 2、low-cost的退化主要的优势是数据量可以几乎无限的合成，通过这样的大量的数据的训练，可以使模型学习到更好的特征，有一个更好的初始化，利于下游high-cost任务的finetune。high-cost数据比较难获取，所以难以用于支持预训练。

lyh-18 / DegAE_DegradationAutoencoder

Question on the Pretrain Datasets #4