qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
332 stars 27 forks source link

How to use the generated PAGE-XML as input to TrOCR? #58

Closed jarrod-dexter closed 2 years ago

jarrod-dexter commented 2 years ago

Thanks for the great library!

I am new to CV/NLP and I am trying to figure out what to do with the generated PAGE-XML (I struggled to find accessible online content about the format).

I am essentially trying to figure out how to extract a set of single sentences I could use as input to TrOCR as mentioned in the following post: https://github.com/microsoft/unilm/issues/451#issuecomment-961408406.

Would it be possible to provide some guidance?

Many thanks!

J

cneud commented 2 years ago

Sorry for the late reply!

Here is some background info on the PAGE-XML format:

More specifically about your question, I am not too familiar yet with TrOCR, but I assume you would have to extract the text lines (TextLine) with their bounding boxes/polygons (Coords) from the PAGE-XML output of Eynollah to derive text line images and feed the according snippets to TrOCR for text recognition/prediction.