Open Wangdanchunbufuz opened 7 months ago
Thank you very much for your excellent paper. Now I plan to implement a dense caption model on my own data set. My benchmark model is non-Transformer. Can I use the Cross-Modal Cycle Consistency loss which you proposed?
Hi, thanks for your interest. The loss is independent of the architecture, so if your data and model output fits then for sure you can use the Cross-Modal Cycle Consistency loss.
Can I add your loss function to my non-Transformer model? such as <Event-Centric Hierarchical Representation for Dense Video Captioning>