Feature Request: Text position detection in TR-OCR

Feature Request: Text Position Detection in TR-OCR Model I am using TrOCR

Description:

I have been using TR-OCR for text recognition in images and it has been performing well. However, for some use-cases, it's crucial to not only recognize the text but also determine their positions within the images. This feature would be extremely useful in document digitalization and analysis where the position of text could hold significant meaning.

Proposed Solution:

Extend the TR-OCR API to include an additional method/parameter that enables text position detection.
The method/parameter could return the bounding box coordinates (X, Y, Width, Height) of each detected text elements (on character, word, sentence level).

Use Case:

This feature would be helpful in various scenarios such as:

Document digitalization where the position of text is crucial for understanding the document structure.
Image analysis where text position could provide additional context.

Additional Information:

I'm willing to contribute into this

I tried searching for this a lot, but maybe I'm missing something. If so, please let me know how to get it done.

### Tasks

microsoft / unilm