jscancella / NYTribuneOCRExperiments

experiments trying to generate better OCR for the NY tribune newspapers from Chronam
GNU General Public License v3.0
12 stars 1 forks source link

convert from HOCR to AlTO #1

Open jscancella opened 3 years ago

jscancella commented 3 years ago

look at https://github.com/UB-Mannheim/ocr-fileformat for converting from tesseract HOCR to Chronam Alto

jscancella commented 3 years ago

Windows Powershell

docker run --rm -it -v ${pwd}:/data ubma/ocr-fileformat ocr-transform hocr alto2.0 0001_xStart0_xEnd937.hocr 0001_xStart0_xEnd937.alto -- '!indent=yes'

Linux bash

docker run --rm -it -v "$PWD":/data ubma/ocr-fileformat ocr-transform hocr alto2.0 0001_xStart0_xEnd937.hocr 0001_xStart0_xEnd937.alto -- '!indent=yes'

jscancella commented 3 years ago

0001_xStart0_xEnd937.hocr.txt example file, rename to 0001_xStart0_xEnd937.hocr