I am currently training CLIP model on my own dataset.
For image part, all images come from different set of sequential images and for text part, the title is scene, weather, light, for example: 'urban, sunny, backlight'.
The loss isnt converging
As for now i think the problem is that some of the images within a batch are from the same sequence, images will be alike and are likely to have same label. But model will still try to contrast it, make it hard to learn any useful information.
Any advice or idea will be greatly appreciated
I am currently training CLIP model on my own dataset. For image part, all images come from different set of sequential images and for text part, the title is scene, weather, light, for example: 'urban, sunny, backlight'. The loss isnt converging As for now i think the problem is that some of the images within a batch are from the same sequence, images will be alike and are likely to have same label. But model will still try to contrast it, make it hard to learn any useful information. Any advice or idea will be greatly appreciated