peterwilliams97 commented 4 years ago

This is a major update to the text extraction code that works with text arranged in columns.

extractor/text.go is now split across multiple text_*.go files.
the new design is summarised in the extractor README.

Here are new PDFs and text extraction references files for extractor/text_test.go.

reference.zip + eu.page005.txt +[Productivity.page001.txt] (https://github.com/unidoc/unipdf/files/4735832/Productivity.page001.txt) + we-dms.page001.txt + radar-eng.page002.txt + Nuance.page001.txt
pdfs.zip + eu.pdf + Productivity.pdf + we-dms.pdf + radar-eng.pdf +Nuance.pdf

You can also run pdf_extract_text.go to see the extraction. There is an updated version of this test here that makes it easier to test a corpus of PDFs.

This change is

peterwilliams97 commented 4 years ago

The PR is updated as discussed. Please let me know what I need to fix or discuss. I had to comment out an extraction test in creator_test.go to get testing to pass. https://github.com/peterwilliams97/unipdf/blob/columns/extractor/README.md explains how the code works at a high level.

codecov[bot] commented 4 years ago

Codecov Report

Merging #366 into development will increase coverage by 6.39%. The diff coverage is 86.77%.

@@               Coverage Diff               @@
##           development     #366      +/-   ##
===============================================
+ Coverage        56.28%   62.67%   +6.39%     
===============================================
  Files              239      248       +9     
  Lines            46261    47146     +885     
===============================================
+ Hits             26039    29550    +3511     
- Misses           16866    16920      +54     
+ Partials          3356      676    -2680

Impacted Files	Coverage Δ
internal/textencoding/simple.go	`89.60% <0.00%> (+6.39%)`	:arrow_up:
model/internal/fonts/ttfparser.go	`72.09% <0.00%> (+6.55%)`	:arrow_up:
extractor/text.go	`68.73% <75.75%> (+4.53%)`	:arrow_up:
extractor/text_bag.go	`79.38% <79.38%> (ø)`
extractor/text_para.go	`80.76% <80.76%> (ø)`
extractor/text_table.go	`86.87% <86.87%> (ø)`
extractor/text_word.go	`87.77% <87.77%> (ø)`
extractor/extractor.go	`57.14% <90.00%> (-26.20%)`	:arrow_down:
extractor/text_mark.go	`93.39% <93.39%> (ø)`
extractor/text_page.go	`93.43% <93.43%> (ø)`
... and 173 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 54e9657...fe35826. Read the comment docs.

peterwilliams97 commented 4 years ago

extractor/text.go, line 101 at r6 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Should use Trace for this? or specifically want to look at this? Trace has a lot of stuff

I have needed to look at the operators a few times while developing the columns code. Eventually this code will be known to be bug free, but I am not 100% sure that it is now.

peterwilliams97 commented 4 years ago

extractor/text.go, line 492 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

if font is not supported, is there anything that makes sense to do? Probably need to collect such cases and look at.

This case doesn't happen.

peterwilliams97 commented 4 years ago

extractor/text_test.go, line 84 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Any reason for changing from -10 to -25?

It should always have been -25 to match the unrotated case. I can't recall why I set it to -10 for the old text extraction code. The new text extraction code correctly treats the -10 case as overlapping text and the test is expecting non-overlapping text.

peterwilliams97 commented 4 years ago

extractor/text_test.go, line 653 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

should we remove it, if its corrupt?

Done

peterwilliams97 commented 4 years ago

extractor/text_utils.go, line 41 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

x-direction, same as reading direction, and y-direction depth direction? Or purely x/y at this level?

This only gets used for table cell detection so it is x/y.

peterwilliams97 commented 4 years ago

model/internal/fonts/ttfparser.go, line 212 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

check

Sorry. I don't understand that.

peterwilliams97 commented 4 years ago

extractor/text_mark.go, line 26 at r6 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Yes this can be useful

Done.

peterwilliams97 commented 4 years ago

extractor/text_test.go, line 84 at r14 (raw file):

Previously, peterwilliams97 (Peter Williams) wrote…

It should always have been -25 to match the unrotated case. I can't recall why I set it to -10 for the old text extraction code. The new text extraction code correctly treats the -10 case as overlapping text and the test is expecting non-overlapping text.

Done.

peterwilliams97 commented 4 years ago

extractor/text_test.go, line 602 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

In that case should we remove the commented test codes?

Done.

peterwilliams97 commented 4 years ago

internal/textencoding/simple.go, line 58 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

Needs to work with go 1.12

Done.

peterwilliams97 commented 4 years ago

model/const.go, line 24 at r14 (raw file):

Previously, gunnsth (Gunnsteinn Hall) wrote…

needs to work with 1.12

Done.

unidoc / unipdf

Text extraction code for columns. #366

Codecov Report