The reason why I looked at `transkribus-to-prima` is that we noticed a new strange thing in Transkribus page files: they contain text regions (`TextRegion`) with text lines (`TextLine`) which contain text (`TextEquiv`), but the text for the region (`TextEquiv`) is empty. Example: first text region in https://raw.githubusercontent.com/UB-Mannheim/reichsanzeiger-gt/main/page-xml/1820_84_0220.xml. Converting that PAGE XML file to text with `ocr-transform` results therefore in missing text.
If that is a common problem with Transkribus files, adding a fix for it to transkribus-to-prima might be a good idea.
If that is a common problem with Transkribus files, adding a fix for it to
transkribus-to-prima
might be a good idea.See also https://github.com/UB-Mannheim/reichsanzeiger-gt/issues/1.
Originally posted by @stweil in https://github.com/kba/transkribus-to-prima/issues/17#issuecomment-1207271758