kba / transkribus-to-prima

Convert Transkribus PAGE-XML to standard PAGE-XML
11 stars 2 forks source link

Empty TextEquivs #21

Open kba opened 2 weeks ago

kba commented 2 weeks ago
          The reason why I looked at `transkribus-to-prima` is that we noticed a new strange thing in Transkribus page files: they contain text regions (`TextRegion`) with text lines (`TextLine`) which contain text (`TextEquiv`), but the text for the region (`TextEquiv`) is empty. Example: first text region in https://raw.githubusercontent.com/UB-Mannheim/reichsanzeiger-gt/main/page-xml/1820_84_0220.xml. Converting that PAGE XML file to text with `ocr-transform` results therefore in missing text.

If that is a common problem with Transkribus files, adding a fix for it to transkribus-to-prima might be a good idea.

See also https://github.com/UB-Mannheim/reichsanzeiger-gt/issues/1.

Originally posted by @stweil in https://github.com/kba/transkribus-to-prima/issues/17#issuecomment-1207271758