Closed mcairlangga2 closed 9 months ago
Hi @mcairlangga2,
Yes, it is possible however we did not consider this research direction. With the recent advancements in NLP & Vision-Language modeling, it will be a worth exploring problem.
As per the implementation is concerned, it would be replacing RoberTa at this file with CLIP or other text decoder. Good Luck
Dear Authors, Thank you for the great work.
I want to ask a question. Is it possible to change the text encoder to other models such as CLIP instead of using RoBERTa? Have you considered and tried another Text Encoder? If it's possible how to change it in the code?
Thank you!