kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
215 stars 70 forks source link

Invalid alto xml #136

Closed e-tornike closed 2 years ago

e-tornike commented 2 years ago

Hi there,

I seem to be getting invalid ALTO XML files.

I am using a forked version, which has a couple of changed flags in install_deps.sh, and installing via Docker. The Dockerfile is the following:

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update \
    && apt-get install -y wget cmake clang git pkg-config

# See installation instructions @ https://github.com/kermitt2/pdfalto
RUN mkdir /home/pdfalto \
    && cd /home/pdfalto \
    && git clone https://github.com/e-tornike/pdfalto \
    && cd pdfalto \
    && ./install_deps.sh \
    && git submodule update --init --recursive \
    && cmake ./ \
    && make

USER root

ENTRYPOINT ["tail", "-f", "/dev/null"]

To reproduce:

docker build -f /PATH/TO/Dockerfile -t pdfalto:latest .
docker run -d -v /PATH/TO/INPUT/DIR:/home/pdf/input -v /PATH/TO/OUTPUT/DIR:/home/pdf/output --name pdfaltocontainer pdfalto:latest
docker exec -it pdfaltocontainer /home/pdfalto/pdfalto/pdfalto /home/pdf/input/INPUT.pdf /home/pdf/output/OUTPUT.xml

Evaluating the validity fails (in Python):

from xmlschema import XMLSchema

input_file = "/PATH/TO/OUTPUT/DIR/OUTPUT.xml"
schema = XMLSchema("http://www.loc.gov/standards/alto/v4/alto.xsd")
schema.is_valid(input_file)

I appreciate any help :) Cheers!

kermitt2 commented 2 years ago

Hi @e-tornike !

It might be a mismatch between the schema version you use in your python script (v4) and the one used in the ALTO file (v3), so try changing:

schema = XMLSchema("http://www.loc.gov/standards/alto/v4/alto.xsd")

into

schema = XMLSchema("http://www.loc.gov/standards/alto/v3/alto.xsd")

I double checked with current built master, with the right schema (v3) it validates:

lopez@work:~/pdfalto$ ./pdfalto /media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26.pdf
lopez@work:~/pdfalto$ xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd  /media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26.xml
/media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26.xml validates

If I change manually the ALTO schema version declared in the XML header (v3 to v4), it also validates with v4:

lopez@work:~/pdfalto$ xmllint --schema http://www.loc.gov/standards/alto/v4/alto.xsd  /media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26-v4.xml
/media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26-v4.xml validates
e-tornike commented 2 years ago

Hi @kermitt2,

thanks for the quick reply. I've tested both v3 and v4, but got the same results.

I am able to produce a valid XML using the PDF you used (2020.conll-1.26) or using some others (e.g., 2021.acl-long.1), but validation fails on certain PDFs (e.g., arXiv:2112.11446, arXiv:2112.11176, 2021.emnlp-main.4).

For the non-validating PDFs, the message error is the same. For arXiv:2112.11446, it's:

$xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd ../example_pdfs/output/2112.11176.xml

../example_pdfs/output/2112.11176.xml:2: element TextLine: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextLine': Missing child element(s). Expected is one of ( {http://www.loc.gov/standards/alto/ns-v3#}Shape, {http://www.loc.gov/standards/alto/ns-v3#}String ).
kermitt2 commented 2 years ago

Thank you @e-tornike for providing some example cases !

The validations for the 2 PDF of the 3 you indicating are successful for me with current master:

xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd  /media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2021.emnlp-main.4.xml
/media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2021.emnlp-main.4.xml validates

xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd  /media/lopez/data2/training-grobid-working/training-0.7.1/arxiv/pdf/2112.11176.xml
/media/lopez/data2/training-grobid-working/training-0.7.1/arxiv/pdf/2112.11176.xml validates

But I reproduced the problem for the third one (arXiv:2112.11446) with the same error

/media/lopez/data2/training-grobid-working/training-0.7.1/arxiv/pdf/2112.11446.xml:2: element TextLine: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextLine': Missing child element(s). Expected is one of ( {http://www.loc.gov/standards/alto/ns-v3#}Shape, {http://www.loc.gov/standards/alto/ns-v3#}String ).

and indeed there is a case where it is possible to produce empty line element related to line number identification, so I made a simple quick fix with 8bb209c0c21476ee904ac2518b807828dd6db732 to avoid this.

All cases validate now, and this should solve the issue in general.

e-tornike commented 2 years ago

Thank you, @kermitt2, for the quick fix! I can verify that the XML files now validate. Cheers!