Closed peterwilliams97 closed 4 years ago
The PR is updated as discussed. Please let me know what I need to fix or discuss. I had to comment out an extraction test in creator_test.go to get testing to pass. https://github.com/peterwilliams97/unipdf/blob/columns/extractor/README.md explains how the code works at a high level.
Merging #366 into development will increase coverage by
6.39%
. The diff coverage is86.77%
.
@@ Coverage Diff @@
## development #366 +/- ##
===============================================
+ Coverage 56.28% 62.67% +6.39%
===============================================
Files 239 248 +9
Lines 46261 47146 +885
===============================================
+ Hits 26039 29550 +3511
- Misses 16866 16920 +54
+ Partials 3356 676 -2680
Impacted Files | Coverage Δ | |
---|---|---|
internal/textencoding/simple.go | 89.60% <0.00%> (+6.39%) |
:arrow_up: |
model/internal/fonts/ttfparser.go | 72.09% <0.00%> (+6.55%) |
:arrow_up: |
extractor/text.go | 68.73% <75.75%> (+4.53%) |
:arrow_up: |
extractor/text_bag.go | 79.38% <79.38%> (ø) |
|
extractor/text_para.go | 80.76% <80.76%> (ø) |
|
extractor/text_table.go | 86.87% <86.87%> (ø) |
|
extractor/text_word.go | 87.77% <87.77%> (ø) |
|
extractor/extractor.go | 57.14% <90.00%> (-26.20%) |
:arrow_down: |
extractor/text_mark.go | 93.39% <93.39%> (ø) |
|
extractor/text_page.go | 93.43% <93.43%> (ø) |
|
... and 173 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 54e9657...fe35826. Read the comment docs.
extractor/text.go, line 101 at r6 (raw file):
Should use Trace for this? or specifically want to look at this? Trace has a lot of stuff
I have needed to look at the operators a few times while developing the columns code. Eventually this code will be known to be bug free, but I am not 100% sure that it is now.
extractor/text.go, line 492 at r14 (raw file):
if font is not supported, is there anything that makes sense to do? Probably need to collect such cases and look at.
This case doesn't happen.
extractor/text_test.go, line 84 at r14 (raw file):
Any reason for changing from -10 to -25?
It should always have been -25 to match the unrotated case. I can't recall why I set it to -10 for the old text extraction code. The new text extraction code correctly treats the -10 case as overlapping text and the test is expecting non-overlapping text.
extractor/text_test.go, line 653 at r14 (raw file):
should we remove it, if its corrupt?
Done
extractor/text_utils.go, line 41 at r14 (raw file):
x-direction, same as reading direction, and y-direction depth direction? Or purely x/y at this level?
This only gets used for table cell detection so it is x/y.
model/internal/fonts/ttfparser.go, line 212 at r14 (raw file):
check
Sorry. I don't understand that.
extractor/text_mark.go, line 26 at r6 (raw file):
Yes this can be useful
Done.
extractor/text_test.go, line 84 at r14 (raw file):
It should always have been -25 to match the unrotated case. I can't recall why I set it to -10 for the old text extraction code. The new text extraction code correctly treats the -10 case as overlapping text and the test is expecting non-overlapping text.
Done.
extractor/text_test.go, line 602 at r14 (raw file):
In that case should we remove the commented test codes?
Done.
internal/textencoding/simple.go, line 58 at r14 (raw file):
Needs to work with go 1.12
Done.
model/const.go, line 24 at r14 (raw file):
needs to work with 1.12
Done.
This is a major update to the text extraction code that works with text arranged in columns.
Here are new PDFs and text extraction references files for extractor/text_test.go.
reference.zip + eu.page005.txt +[Productivity.page001.txt] (https://github.com/unidoc/unipdf/files/4735832/Productivity.page001.txt) + we-dms.page001.txt + radar-eng.page002.txt + Nuance.page001.txt
pdfs.zip + eu.pdf + Productivity.pdf + we-dms.pdf + radar-eng.pdf +Nuance.pdf
You can also run pdf_extract_text.go to see the extraction. There is an updated version of this test here that makes it easier to test a corpus of PDFs.
This change is![Reviewable](https://reviewable.io/review_button.svg)