Open gjreda opened 1 year ago
I did debug this stuff, and I'm reporting it, not really sure what is the proper way to deal with this. There is an unlucky combination of factors, that makes this PDF failing.
1) There is an invalid tagging of part of the text in page 2 (3. We suggest .....
at the beginning of the second column), it's really bad luck here because from the text point of view it seems there are two notes coming in sequence at the bottom of the page:
2. We 2. 2 2. 2. 2. BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 ALLCAP CONTAINSDIGITS 0 0 0 0 0 0 0 0 0 4 . 1 8 0 1 0 0 1 <body>
FACTSCORE with factscore F FA FAC FACT BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 ALLCAP NODIGIT 0 0 0 0 0 0 0 0 0 4 ,- 2 8 0 1 0 0 1 <body>
lowing evaluation lowing l lo low lowi BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 1 0 0 0 0 0 0 4 no 0 8 0 1 0 0 1 <body>
without manual without w wi wit with BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 1 0 0 0 0 0 0 4 . 1 6 0 1 0 0 1 <body>
2 perplexity.ai 2 2 2 2 2 BLOCKSTART PAGEIN SAMEFONT LOWERFONT 0 0 NOCAPS ALLDIGIT 1 0 0 0 0 0 0 0 0 5 . 1 10 0 1 0 0 1 I-<footnote>
3. We 3. 3 3. 3. 3. BLOCKSTART PAGEIN SAMEFONT HIGHERFONT 0 0 ALLCAP CONTAINSDIGITS 0 0 0 0 0 0 0 0 0 5 . 1 10 0 0 0 0 1 <footnote>
for a for f fo for for BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 1 0 0 0 0 0 0 5 (..,- 5 9 0 0 0 0 1 <footnote>
ended generation) ended e en end ende BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 1 0 0 0 0 0 0 5 ) 1 9 0 0 0 0 1 <footnote>
estimator. estimator. estimator. e es est esti BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 0 0 0 0 0 0 0 5 . 1 2 0 0 0 0 1 <footnote>
2 Related 2 2 2 2 2 BLOCKSTART PAGEIN NEWFONT HIGHERFONT 0 0 NOCAPS ALLDIGIT 1 0 0 0 0 0 0 0 0 6 no 0 10 0 0 0 0 1 I-<body>
The other footnote in page 2, is correctly recognised, but has the same label, 3:
that we that t th tha that BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 1 0 0 0 0 0 1 9 -- 2 8 0 1 0 0 1 <body>
rather than rather r ra rat rath BLOCKEND PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 1 0 0 0 0 0 1 9 --. 3 9 0 1 0 0 1 <body>
3 Consisting 3 3 3 3 3 BLOCKSTART PAGEIN NEWFONT LOWERFONT 0 0 NOCAPS ALLDIGIT 1 0 0 0 0 0 0 0 1 11 -(), 4 9 0 1 0 0 1 I-<footnote>
18-29 in 18-29 1 18 18- 18-2 BLOCKEND PAGEEND SAMEFONT SAMEFONTSIZE 0 0 ALLCAP CONTAINSDIGITS 0 0 0 0 0 0 0 0 1 11 -.().(). 8 10 0 1 0 0 1 <footnote>
Model-based Evaluation. model-based M Mo Mod Mode BLOCKSTART PAGESTART NEWFONT HIGHERFONT 0 0 INITCAP NODIGIT 0 0 1 0 0 0 0 0 1 0 -. 2 8 0 0 0 0 1 I-<body>
learned models learned l le lea lear BLOCKIN PAGEIN SAMEFONT SAMEFONTSIZE 0 0 NOCAPS NODIGIT 0 0 1 0 0 0 0 0 1 0 no 0 8 0 0 0 0 1 <body>
for (Note note : notesSamePage) {
Optional<LayoutToken> matching = clusterTokens
.stream()
.filter(t -> t.getText().equals(note.getLabel()) && t.isSuperscript())
.findFirst();
This is using the first note label which has value "3" to match the layout token text but matches it the right anchor in the text:
Then, the same happens to the second label (3), which matches two different notes. Then we get two intervals referring to the same note.
Thank you for looking into this @lfoppiano! Sounds like it's just a poorly structured PDF and there's not much that Grobid can do.
(re-opening to try to fix the the exception properly in this case)
I'm running Grobid via the lfoppiano/grobid:0.7.3-arm docker container on an M1 MacbookAir with macOS 13.3.1.
When trying to run Grobid against a particular PDF, I receive an unexpected exception due to an IllegalArgumentException.
Is there something particular about this PDF that causes Grobid to fail? I'd love to understand the exception more so I can watch out for similar ones within the project I'm working on.
Steps to reproduce
Download this PDF and, using the python client, call the Grobid server:
It seems
pdfalto_server
seems to run correctly against the PDF and create the appropriate lxml files:Below is the stacktrace that gets returned to the python client.