kba / transkribus-to-prima

Convert Transkribus PAGE-XML to standard PAGE-XML
11 stars 2 forks source link

Fix some typos (found by codespell) #17

Closed stweil closed 2 weeks ago

stweil commented 1 year ago

Signed-off-by: Stefan Weil sw@weilnetz.de

stweil commented 1 year ago

The reason why I looked at transkribus-to-prima is that we noticed a new strange thing in Transkribus page files: they contain text regions (TextRegion) with text lines (TextLine) which contain text (TextEquiv), but the text for the region (TextEquiv) is empty. Example: first text region in https://raw.githubusercontent.com/UB-Mannheim/reichsanzeiger-gt/main/page-xml/1820_84_0220.xml. Converting that PAGE XML file to text with ocr-transform results therefore in missing text.

If that is a common problem with Transkribus files, adding a fix for it to transkribus-to-prima might be a good idea.

See also https://github.com/UB-Mannheim/reichsanzeiger-gt/issues/1.