kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
206 stars 67 forks source link

ALTO version with latest release #44

Open ghost opened 5 years ago

ghost commented 5 years ago

Previously, we used pdfalto to generate an ALTO XML from the pdf and https://github.com/filak/hOCR-to-ALTO to convert the ALTO XML to hOCR file after that. With the newest release of pdfalto this does not work anymore, since the ALTO version has seemed to have changed. Can you share which version of ALTO is currently produced with pdfalto?

Aazhar commented 5 years ago

Could you provide more informations, are there any logs of the error stack trace ?

Aazhar commented 5 years ago

The alto schema version didn't change, version 3.1 is used since the first pdfalto release : https://github.com/kermitt2/pdfalto/blob/master/schema/alto.xsd

ghost commented 5 years ago

Earlier the schemain the alto xml was: xmlns="http://www.loc.gov/standards/alto/ns-v3#", but now I get: xmlns="http://www.loc.gov/standards/alto/v3/alto.xsd"

Aazhar commented 5 years ago

this was updated because the first link is wrong, it's not pointing to the schema.

burki commented 5 years ago

@Aazhar Schema-location and Namespace URL don't have to be identical. xmlns should be http://www.loc.gov/standards/alto/ns-v3# (see targetNamespace="http://www.loc.gov/standards/alto/ns-v3#" in http://www.loc.gov/standards/alto/v3/alto.xsd)

For schema location, you can use something like

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/v3/alto.xsd"
kermitt2 commented 3 years ago

Added xsi:schemaLocation with d49bf77204d1700b7263cb2641aa508c33058c9c

<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xsi:schemaLocation="http://www.loc.gov/standards/alto/v3/alto.xsd">