Open lfoppiano opened 2 days ago
I'm delving into this issue, looking at
it seems that the tables are validated before being postprocessed, however the tables that do not pass the validation are not marked and dealt, somehow, I wonder if those should be just marked as paragraph and returned as fulltext. 🤔
My plan is to take those tables and reset the classification into the fulltext. It's probably better to have a table mangled in the <paragraph>
rather than having text missing from the output 🤔
Regarding this issue, another worrying thing is that <table>
does not have the initial I-
prefix:
( ( ( ( ( ( ( ( ( ( BLOCKIN LINEIN LINEINDENT SAMEFONT SAMEFONTSIZE 0 0 ALLCAP NODIGIT 1 OPENBRACKET 9 2 0 NUMBER 0 0 I-<citation_marker>
1 1 1 1 1 1 1 1 1 1 BLOCKIN LINEIN LINEINDENT SAMEFONT SAMEFONTSIZE 0 0 NOCAPS ALLDIGIT 1 NOPUNCT 9 2 0 NUMBER 1 0 <table>
I have been reported a few cases of text disappearing from the fulltext.
I've identified two issues related to figures and tables.
First case, where paragraphs are misclassified as tables, by the
fulltext
model:Subsequently, the table model classify all the text as
<content>
,and the incomplete table is then tossed away.
I wonder whether it would be possible to detect false positive tables by the related classes and convert them as
<paragraph>
PDF: pub.1158465915.pdf