mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
724 stars 130 forks source link

What methods do you recommend for converting PageXML to Alto #545

Closed particitae closed 12 months ago

particitae commented 1 year ago

Hi I've tried different software (xslt and software) and none of them work. What methods do you recommend for converting PageXML to Alto ? Perhaps the better way is to use kraken library.

thanks for your answer.

mittagessen commented 12 months ago

Sorry for the delay. You can use kraken but it isn't round-tripable and you will use information that isn't "useful" for processing inside kraken. A fairly decent tool seems to be ocr-fileformat (https://github.com/UB-Mannheim/ocr-fileformat) but I haven't personally used it.

particitae commented 12 months ago

ocr-fileformat doesn't work.....

the conversion with Hocr format seems ok

mittagessen commented 12 months ago

On 23/10/09 06:09AM, Particitae wrote:

ocr-fileformat doesn't work.....

the conversion with Hocr format seems ok

Hm ok. What doesn't work/what information are you losing? If you're only interested in regions/lines and don't care about anything else being discarded you can use the XMLPage -> Segmentation -> serialization.serialize pipeline between the different formats but it is absolutely not designed to be lossless.