Closed e-tornike closed 2 years ago
Hi @e-tornike !
It might be a mismatch between the schema version you use in your python script (v4
) and the one used in the ALTO file (v3
), so try changing:
schema = XMLSchema("http://www.loc.gov/standards/alto/v4/alto.xsd")
into
schema = XMLSchema("http://www.loc.gov/standards/alto/v3/alto.xsd")
I double checked with current built master, with the right schema (v3
) it validates:
lopez@work:~/pdfalto$ ./pdfalto /media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26.pdf
lopez@work:~/pdfalto$ xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd /media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26.xml
/media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26.xml validates
If I change manually the ALTO schema version declared in the XML header (v3
to v4
), it also validates with v4
:
lopez@work:~/pdfalto$ xmllint --schema http://www.loc.gov/standards/alto/v4/alto.xsd /media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26-v4.xml
/media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2020.conll-1.26-v4.xml validates
Hi @kermitt2,
thanks for the quick reply. I've tested both v3
and v4
, but got the same results.
I am able to produce a valid XML using the PDF you used (2020.conll-1.26) or using some others (e.g., 2021.acl-long.1), but validation fails on certain PDFs (e.g., arXiv:2112.11446, arXiv:2112.11176, 2021.emnlp-main.4).
For the non-validating PDFs, the message error is the same. For arXiv:2112.11446, it's:
$xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd ../example_pdfs/output/2112.11176.xml
../example_pdfs/output/2112.11176.xml:2: element TextLine: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextLine': Missing child element(s). Expected is one of ( {http://www.loc.gov/standards/alto/ns-v3#}Shape, {http://www.loc.gov/standards/alto/ns-v3#}String ).
Thank you @e-tornike for providing some example cases !
The validations for the 2 PDF of the 3 you indicating are successful for me with current master:
xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd /media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2021.emnlp-main.4.xml
/media/lopez/data2/training-grobid-working/training-0.7.1/acl/pdf/2021.emnlp-main.4.xml validates
xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd /media/lopez/data2/training-grobid-working/training-0.7.1/arxiv/pdf/2112.11176.xml
/media/lopez/data2/training-grobid-working/training-0.7.1/arxiv/pdf/2112.11176.xml validates
But I reproduced the problem for the third one (arXiv:2112.11446) with the same error
/media/lopez/data2/training-grobid-working/training-0.7.1/arxiv/pdf/2112.11446.xml:2: element TextLine: Schemas validity error : Element '{http://www.loc.gov/standards/alto/ns-v3#}TextLine': Missing child element(s). Expected is one of ( {http://www.loc.gov/standards/alto/ns-v3#}Shape, {http://www.loc.gov/standards/alto/ns-v3#}String ).
and indeed there is a case where it is possible to produce empty line element related to line number identification, so I made a simple quick fix with 8bb209c0c21476ee904ac2518b807828dd6db732 to avoid this.
All cases validate now, and this should solve the issue in general.
Thank you, @kermitt2, for the quick fix! I can verify that the XML files now validate. Cheers!
Hi there,
I seem to be getting invalid ALTO XML files.
I am using a forked version, which has a couple of changed flags in
install_deps.sh
, and installing via Docker. The Dockerfile is the following:To reproduce:
Evaluating the validity fails (in Python):
I appreciate any help :) Cheers!