Closed Puiching-Memory closed 6 months ago
We follow the SVD's pipeline. If the video contains much text, it is hard to generate as the captioning model cannot get the text.
In the future, we plan to use OCR model to generate additional captions for generation, and thus make the model able for text generation.
In your reports, use OCR to identify the text in the images and then eliminate scenes with too much text.
I want to know why too much text affects the model generation.
If so, does that mean that it's difficult to improve the model for text generation, such as newspapers, streets with billboards, and various signs on the driveway lines?