Open mauvilsa opened 4 years ago
hello could you please share the pdfalto options you used to generate the xml file? thanks
@Aazhar no options. The exact command is in the first comment right after "Steps to reproduce".
I just came across this too. Tesseract (for example) produces alto v3.0 with a header that starts like this:
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
A pdfalto header I generated started:
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#"
I ran into this when trying to subsequently convert the alto to txt using alto-ocr-text by @cneud , who can perhaps assist here.
That in turn was to use ocreval by @eddieantonio, since it doesn't ingest alto data.
Thanks for all help and a super-useful tool
I have created a pull request (https://github.com/kermitt2/pdfalto/pull/72) with changes to allow generation of alto xmls that are valid according to the schema. With these changes the pdfalto tool by default would not generate a valid alto. For the xmls to be valid the option -blocks
is required, i.e. call pdfalto -blocks file.pdf file_alto.xml
.
I have tested the changes with many weird pdfs (the ones at https://github.com/mozilla/pdf.js/tree/master/test/pdfs) to make sure that many edge cases are taken care of.
The pdfalto
tool has several options and many of them I guess would make the xmls invalid due to other causes. In my opinion the tool should generate by default valid alto files, so if -blocks
is not given the alto xml should have a single text block that includes all text lines. Other options that cause the xmls to be invalid should clearly state this in the help.
Many thanks @mauvilsa for the issue and the PR !
I've started to review the options. You're absolutely right about the -blocks
option, it does not make sense to have a default mode that does not validate with the alto schema. I think the best would be to simply remove the -blocks
option and always output the block information. This is more consistent with the goal of the tool and there is no particular reason actually to produce a file without block information. It's probably the same for the -noText
option.
@kermitt2 yes basically the -block option is an adaptation for Grobid, but indeed pdfalto should produce a valid document, then it would up to the parsers to handle particularities. Regarding the PR, it looks fine for me, nevertheless it would nice to remove this option too, otherwise I'll open another PR , WDYT ?
I am fine with merging https://github.com/kermitt2/pdfalto/pull/72 and having another pull request for the -block option.
So I plan the following:
0.3
, because there were quite a few important fixes lately (in particular #54) and some additions like sub/superscript information. As we are changing the output format and the options here, it's better to move to a new working version.-blocks
, -noText
, etc., thanks @deepseek, this is helpful to know that -noText
is used and need to be kept)0.4
version with the current open issues, in particular the reading order problemsI will try to manage better the releases with release notes. for instance, we are in the middle of updating the binaries of pdfalto in GROBID and we are mixing different versions with different output depending to the platform, which is not good :D
I have also compiled a large set of PDF (including pathological ones and the ones you pointed @mauvilsa https://github.com/mozilla/pdf.js/tree/master/test/pdfs) for more systematic tests.
Again many thanks for your contributions!
@kermitt2 Thanks a lot for this really useful code. I notice that the xml output (release 0.4.0) is starting with this:
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xsi:schemaLocation="http://www.loc.gov/standards/alto/v3/alto.xsd">
as result of this command pdfalto -noLineNumbers -noImage -noImageInline -readingOrder pg_0012.pdf 12.xml
That XML is not valid and I need to add xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
to make it validated.
Do I miss anything?
Thanks again
Thank you @giancarlobi ! Indeed the xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
is required for the xsi
namespace (it looks obvious when I look at it now, but you known xml...).
> xmllint --schema http://www.loc.gov/standards/alto/v3/alto.xsd ~/tmp/in/020C2FBB3A8426B56756229614573FAB07D5F5AC.xml
...
/home/lopez/tmp/in/020C2FBB3A8426B56756229614573FAB07D5F5AC.xml validates
Fixed with commit 39545c932e60374e54b61f12745de4b71f63918d
Sorry for overlooking this!
@kermitt2 Thanks to you and for this really useful code!!!
Hi @kermitt2, I'm trying to create ALTO XML files from born-digital PDF file to see if we can integrate them in our METS/ALTO viewer for digitized content. Unfortunately the ALTO XML I get is still not valid. This is what Altova XMLSpy says when trying to validate:
File 2020-07-01_01-00005-alto.xml is not valid.
Schema location value 'http://www.loc.gov/standards/alto/v3/alto.xsd' must contain pairs of URIs.
Error location: alto / @xsi:schemaLocation
Details
schemaLocation_pairs: Schema location value 'http://www.loc.gov/standards/alto/v3/alto.xsd' must contain pairs of URIs.
I did read the above comments, but I'm not sure what I'm doing wrong. This is the command I used (on Ubuntu): /pdfalto -noImage -noLineNumbers -f 1 -l 1 2020-07-01_01-00145.pdf
Thank you for your help!
Solved it by replacing
const char *ALTO_LOCATION = "http://www.loc.gov/standards/alto/v3/alto.xsd";
in src/ConstantsXML.cc
with
const char *ALTO_LOCATION = "http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/standards/alto/v3/alto.xsd";
Hi @kermitt2, I'm trying to create ALTO XML files from born-digital PDF file to see if we can integrate them in our METS/ALTO viewer for digitized content. Unfortunately the ALTO XML I get is still not valid. This is what Altova XMLSpy says when trying to validate:
File 2020-07-01_01-00005-alto.xml is not valid. Schema location value 'http://www.loc.gov/standards/alto/v3/alto.xsd' must contain pairs of URIs. Error location: alto / @xsi:schemaLocation Details schemaLocation_pairs: Schema location value 'http://www.loc.gov/standards/alto/v3/alto.xsd' must contain pairs of URIs.
I did read the above comments, but I'm not sure what I'm doing wrong. This is the command I used (on Ubuntu): /pdfalto -noImage -noLineNumbers -f 1 -l 1 2020-07-01_01-00145.pdf
Thank you for your help!
I have cloned the repository, successfully compiled the pdfalto tool as instructed in the readme and processed a pdf file to get a few files as output, including an xml that appears to be an alto xml file. However, if I validate the generated xml against the alto schema, it fails due to multiple reasons.
This seems to be an important issue since the whole objective of the pdfalto tool is generating alto files.
Steps to reproduce:
pdfalto
tool to convert a pdf to an alto xml, i.e.pdfalto file.pdf file_alto.xml
wget https://raw.githubusercontent.com/kermitt2/pdfalto/master/schema/alto.xsd
xmllint --noout --schema alto.xsd file_alto.xml
There are several reasons for the xml to be invalid:
xmlns="http://www.loc.gov/standards/alto/v3/alto.xsd"
however, it must behttp://www.loc.gov/standards/alto/ns-v3#
For one example that I did, the specific errors are: