kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
204 stars 67 forks source link

pdfalto does not generate valid alto xml files #70

Open mauvilsa opened 4 years ago

mauvilsa commented 4 years ago

I have cloned the repository, successfully compiled the pdfalto tool as instructed in the readme and processed a pdf file to get a few files as output, including an xml that appears to be an alto xml file. However, if I validate the generated xml against the alto schema, it fails due to multiple reasons.

This seems to be an important issue since the whole objective of the pdfalto tool is generating alto files.

Steps to reproduce:

  1. Use the pdfalto tool to convert a pdf to an alto xml, i.e. pdfalto file.pdf file_alto.xml
  2. Download the alto schema linked in the readme of pdfalto, i.e. wget https://raw.githubusercontent.com/kermitt2/pdfalto/master/schema/alto.xsd
  3. Validate generated alto xml against the schema, i.e. xmllint --noout --schema alto.xsd file_alto.xml

There are several reasons for the xml to be invalid:

  1. The namespace in the xml is wrong, it has xmlns="http://www.loc.gov/standards/alto/v3/alto.xsd" however, it must be http://www.loc.gov/standards/alto/ns-v3#
  2. If the namespace is fixed to the correct one, the xml still fails to validate
    • The date format in the processingDateTime is invalid
    • FONTSTYLE and FONTCOLOR don't have the correct format.
    • TextLine element(s) in the wrong location.

For one example that I did, the specific errors are:

file_alto.xml:10: element processingDateTime: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}processingDateTime': 'Tue Sep  3 21:30:12 2019
' is not a valid value of the union type '{http://www.loc.gov/standards/alto/ns-v3#}dateTimeType'.
file_alto.xml:21: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:21: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': [facet 'minLength'] The value '' has a length of '0'; this underruns the allowed minimum length of '1'.
file_alto.xml:21: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': '' is not a valid value of the list type '{http://www.loc.gov/standards/alto/ns-v3#}fontStylesType'.
file_alto.xml:22: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:23: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:23: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': [facet 'minLength'] The value '' has a length of '0'; this underruns the allowed minimum length of '1'.
file_alto.xml:23: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': '' is not a valid value of the list type '{http://www.loc.gov/standards/alto/ns-v3#}fontStylesType'.
file_alto.xml:24: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:25: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:26: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:27: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:27: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': [facet 'minLength'] The value '' has a length of '0'; this underruns the allowed minimum length of '1'.
file_alto.xml:27: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': '' is not a valid value of the list type '{http://www.loc.gov/standards/alto/ns-v3#}fontStylesType'.
file_alto.xml:28: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:29: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:29: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': [facet 'minLength'] The value '' has a length of '0'; this underruns the allowed minimum length of '1'.
file_alto.xml:29: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': '' is not a valid value of the list type '{http://www.loc.gov/standards/alto/ns-v3#}fontStylesType'.
file_alto.xml:30: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:31: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:31: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': [facet 'minLength'] The value '' has a length of '0'; this underruns the allowed minimum length of '1'.
file_alto.xml:31: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': '' is not a valid value of the list type '{http://www.loc.gov/standards/alto/ns-v3#}fontStylesType'.
file_alto.xml:32: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTCOLOR': '#000000' is not a valid value of the atomic type 'xs:hexBinary'.
file_alto.xml:32: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': [facet 'minLength'] The value '' has a length of '0'; this underruns the allowed minimum length of '1'.
file_alto.xml:32: element TextStyle: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextStyle', attribute 'FONTSTYLE': '' is not a valid value of the list type '{http://www.loc.gov/standards/alto/ns-v3#}fontStylesType'.
file_alto.xml:37: element TextLine: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextLine': This element is not expected. Expected is one of ( {http://www.loc.gov/standards/alto/ns-v3#}Shape, {http://www.loc.gov/standards/alto/ns-v3#}TextBlock, {http://www.loc.gov/standards/alto/ns-v3#}Illustration, {http://www.loc.gov/standards/alto/ns-v3#}GraphicalElement, {http://www.loc.gov/standards/alto/ns-v3#}ComposedBlock ).
file_alto.xml:416: element TextLine: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextLine': This element is not expected. Expected is one of ( {http://www.loc.gov/standards/alto/ns-v3#}Shape, {http://www.loc.gov/standards/alto/ns-v3#}TextBlock, {http://www.loc.gov/standards/alto/ns-v3#}Illustration, {http://www.loc.gov/standards/alto/ns-v3#}GraphicalElement, {http://www.loc.gov/standards/alto/ns-v3#}ComposedBlock ).
file_alto.xml fails to validate
Aazhar commented 4 years ago

hello could you please share the pdfalto options you used to generate the xml file? thanks

mauvilsa commented 4 years ago

@Aazhar no options. The exact command is in the first comment right after "Steps to reproduce".

jtlz2 commented 4 years ago

I just came across this too. Tesseract (for example) produces alto v3.0 with a header that starts like this:

<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">

A pdfalto header I generated started:

<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#"

I ran into this when trying to subsequently convert the alto to txt using alto-ocr-text by @cneud , who can perhaps assist here.

That in turn was to use ocreval by @eddieantonio, since it doesn't ingest alto data.

Thanks for all help and a super-useful tool

mauvilsa commented 4 years ago

I have created a pull request (https://github.com/kermitt2/pdfalto/pull/72) with changes to allow generation of alto xmls that are valid according to the schema. With these changes the pdfalto tool by default would not generate a valid alto. For the xmls to be valid the option -blocks is required, i.e. call pdfalto -blocks file.pdf file_alto.xml.

I have tested the changes with many weird pdfs (the ones at https://github.com/mozilla/pdf.js/tree/master/test/pdfs) to make sure that many edge cases are taken care of.

The pdfalto tool has several options and many of them I guess would make the xmls invalid due to other causes. In my opinion the tool should generate by default valid alto files, so if -blocks is not given the alto xml should have a single text block that includes all text lines. Other options that cause the xmls to be invalid should clearly state this in the help.

kermitt2 commented 4 years ago

Many thanks @mauvilsa for the issue and the PR !

I've started to review the options. You're absolutely right about the -blocks option, it does not make sense to have a default mode that does not validate with the alto schema. I think the best would be to simply remove the -blocks option and always output the block information. This is more consistent with the goal of the tool and there is no particular reason actually to produce a file without block information. It's probably the same for the -noText option.

Aazhar commented 4 years ago

@kermitt2 yes basically the -block option is an adaptation for Grobid, but indeed pdfalto should produce a valid document, then it would up to the parsers to handle particularities. Regarding the PR, it looks fine for me, nevertheless it would nice to remove this option too, otherwise I'll open another PR , WDYT ?

mauvilsa commented 4 years ago

I am fine with merging https://github.com/kermitt2/pdfalto/pull/72 and having another pull request for the -block option.

kermitt2 commented 4 years ago

So I plan the following:

I will try to manage better the releases with release notes. for instance, we are in the middle of updating the binaries of pdfalto in GROBID and we are mixing different versions with different output depending to the platform, which is not good :D

I have also compiled a large set of PDF (including pathological ones and the ones you pointed @mauvilsa https://github.com/mozilla/pdf.js/tree/master/test/pdfs) for more systematic tests.

Again many thanks for your contributions!

giancarlobi commented 3 years ago

@kermitt2 Thanks a lot for this really useful code. I notice that the xml output (release 0.4.0) is starting with this: <alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xsi:schemaLocation="http://www.loc.gov/standards/alto/v3/alto.xsd"> as result of this command pdfalto -noLineNumbers -noImage -noImageInline -readingOrder pg_0012.pdf 12.xml That XML is not valid and I need to add xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" to make it validated. Do I miss anything? Thanks again

kermitt2 commented 3 years ago

Thank you @giancarlobi ! Indeed the xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" is required for the xsi namespace (it looks obvious when I look at it now, but you known xml...).

> xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd ~/tmp/in/020C2FBB3A8426B56756229614573FAB07D5F5AC.xml
...
/home/lopez/tmp/in/020C2FBB3A8426B56756229614573FAB07D5F5AC.xml validates

Fixed with commit 39545c932e60374e54b61f12745de4b71f63918d

Sorry for overlooking this!

giancarlobi commented 3 years ago

@kermitt2 Thanks to you and for this really useful code!!!

RoxPoNinja commented 2 years ago

Hi @kermitt2, I'm trying to create ALTO XML files from born-digital PDF file to see if we can integrate them in our METS/ALTO viewer for digitized content. Unfortunately the ALTO XML I get is still not valid. This is what Altova XMLSpy says when trying to validate:

File 2020-07-01_01-00005-alto.xml is not valid.
    Schema location value 'http://www.loc.gov/standards/alto/v3/alto.xsd' must contain pairs of URIs.
        Error location: alto / @xsi:schemaLocation
        Details
            schemaLocation_pairs: Schema location value 'http://www.loc.gov/standards/alto/v3/alto.xsd' must contain pairs of URIs.

I did read the above comments, but I'm not sure what I'm doing wrong. This is the command I used (on Ubuntu): /pdfalto -noImage -noLineNumbers -f 1 -l 1 2020-07-01_01-00145.pdf

Thank you for your help!

RoxPoNinja commented 2 years ago

Solved it by replacing

const char *ALTO_LOCATION = "http://www.loc.gov/standards/alto/v3/alto.xsd";

in src/ConstantsXML.cc with

const char *ALTO_LOCATION = "http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/standards/alto/v3/alto.xsd";

Hi @kermitt2, I'm trying to create ALTO XML files from born-digital PDF file to see if we can integrate them in our METS/ALTO viewer for digitized content. Unfortunately the ALTO XML I get is still not valid. This is what Altova XMLSpy says when trying to validate:

File 2020-07-01_01-00005-alto.xml is not valid.
  Schema location value 'http://www.loc.gov/standards/alto/v3/alto.xsd' must contain pairs of URIs.
      Error location: alto / @xsi:schemaLocation
      Details
          schemaLocation_pairs: Schema location value 'http://www.loc.gov/standards/alto/v3/alto.xsd' must contain pairs of URIs.

I did read the above comments, but I'm not sure what I'm doing wrong. This is the command I used (on Ubuntu): /pdfalto -noImage -noLineNumbers -f 1 -l 1 2020-07-01_01-00145.pdf

Thank you for your help!