The code implementation is inconsistent with the paper? The input here should be text instead of images (Text as Image), but the code here uses images for training.
Ah, it's an abuse of the variable name "image". See the definition of the model forward function here. I call the forward function in different ways in training and testing.
The code implementation is inconsistent with the paper? The input here should be text instead of images (Text as Image), but the code here uses images for training.
https://github.com/guozix/TaI-DPT/blob/1333ecaa32bfffb4f2eb916f5532afb88ac457fe/trainers/Caption_distill_double.py#L464C65-L464C65