We speculate that the generalization of ViT from CLIP could be strong; however, training on a small amount of data might lead to catastrophic forgetting. Both the intermediate FC layer and LLM are involved in the training process. Yet, the training strategy remains an open and unresolved issue, and we welcome further discussion on this matter.
We speculate that the generalization of ViT from CLIP could be strong; however, training on a small amount of data might lead to catastrophic forgetting. Both the intermediate FC layer and LLM are involved in the training process. Yet, the training strategy remains an open and unresolved issue, and we welcome further discussion on this matter.