Closed stweil closed 4 months ago
The reason why I looked at transkribus-to-prima
is that we noticed a new strange thing in Transkribus page files: they contain text regions (TextRegion
) with text lines (TextLine
) which contain text (TextEquiv
), but the text for the region (TextEquiv
) is empty. Example: first text region in https://raw.githubusercontent.com/UB-Mannheim/reichsanzeiger-gt/main/page-xml/1820_84_0220.xml. Converting that PAGE XML file to text with ocr-transform
results therefore in missing text.
If that is a common problem with Transkribus files, adding a fix for it to transkribus-to-prima
might be a good idea.
See also https://github.com/UB-Mannheim/reichsanzeiger-gt/issues/1.
Signed-off-by: Stefan Weil sw@weilnetz.de