Closed lfoppiano closed 3 years ago
Here another case. There are four figures that are correctly identified from the fulltext parser, but only one appears in the figure training data:
One thing I forgot to mention is that to reproduce it on the data I provided is to see it from the branch features/add-training-data
because otherwise the segmentation and fulltext extracvtion of the models on master are not getting the figure correctly.
FYI I just merged the branch features/add-training-data
into bugfix/fix-figure-table-training-data-generation
so that the issue can be reproduced.
Same issue for tables, the PR attempt to fix both cases.
I'm working on adding some more training data, at the moment I'm focusing on fulltext and segmentation, however I'm targerting figures and tables segmentation as well.
PR05514733-CC.pdf
In this case, although the fulltext is correctly recognising the figures, there is no training data for the figure model in output.
I've started digging and it seems a problem coming from
FulltextParser.processTrainingDataFigures
, where if there is only one figure (one I-Here a test that tries to reproduce the problem:
And the result is quite strange:
<figure>
block (e.g. line"vs\tvs\tv\tvs\tvs\tvs\ts\tvs\tvs\tvs\tBLOCKIN\tLINEIN\tLINEINDENT\tSAMEFONT\tSAMEFONTSIZE\t0\t0\tNOCAPS\tNODIGIT\t0\tNOPUNCT\t10\t3\t0\tNUMBER\t0\t0\t<figure>\n" +
(does not have theI-
prefix in the label, the output is empty,<figure>
blocks, then the output is:which does not look correct. Could be that I did not re-create a real case (I took an excerprt from the figure in the pdf I'm attaching).