Closed tendrillion closed 3 years ago
Please be aware that the code in this repository, in particular the code in unit tests in the src/test
folder, targets specific problems, mostly formulated in stack overflow questions. Thus, the code in the tests does not represent finished implementations for a task for generic inputs, it only fixes (or tries to fix) the specific issue the original poster has, often for specific inputs. Usually this also reflects in the JavaDoc comments.
For example the ExtractLinesWithDir
test class focusses on the question "PDFBox - Line / Rectangle extraction". Here the asker uses a specific class (the LineCatcher
posted by @Tilman Hausherr in this answer to a different question) and has issues with the coordinates returned by that class.
My improved version in the ExtractLinesWithDir
test method testExtractLineRotationTestWithDir
of the OP's code shows how to interpret the coordinates returned by that LineCatcher
class to also cope with page rotation and the OP's preference of coordinate systems (compare the method comment: this method attempts to extract the coordinates as the OP wants it, i.e. the coordinates of the top left point of the line bounding boxes on the rotated page, the origin in the upper left corner of the page).
But I had not checked how effective the LineCatcher
class is to catch all lines, let alone improved that effectiveness. At first glance it becomes clear that that class only looks for lines drawn by stroking a path, and then returns the bounding box of that path.
In the context of the file demo.pdf
you applied the code to, though, some lines are not drawn by stroking a path but instead by filling a path, a long and slim rectangle. For example the top line in demo.pdf
is a filled rectangle at (98.318, 1514.968), 1020.599 units wide, 0.266 units high.
You can extend the LineCatcher
class to also catch such lines by also adding the boundary box of the linePath
to the rectList
in fillPath
and fillAndStrokePath
like it's already done in strokePath
. But beware, you will only want to collect such paths if they indeed only fill a line, i.e. something long in one direction and slim in the other.
Additionally lines in PDFs may be drawn using other techniques, too. E.g. a bitmap containing a single pixel might be drawn as a line by stretching it using the current transformation matrix. Or there might be an actual bitmap of a line drawn regularly. This bitmap does not even need to be slim as there might be empty space around the line in the bitmap. So those options also have to be considered. And there surely are more funny ways to draw lines used by some people.
Concerning "How to extract all the lines in PDF", therefore, you'll have some work ahead of you, adding all such options to the code. If you need to catch all kinds of lines and at the same time ignore invisible ones (e.g. lines drawn white on white), you probably should render the PDF pages as bitmaps and apply apply image analysis to those bitmaps.
If you have more specific questions, e.g. how to extract a specific type of lines from a specific PDF, please ask on stack overflow, probably referring to the original question or this issue here, and do share the PDF in question.
Not all lines are extracted
pdfbox 2.0.21 linux 18.04 java:jdk1.8.0
test file:src/test/resources/mkl/testarea/pdfbox2/extract/demo.pdf test code:src/test/java/mkl/testarea/pdfbox2/extract/ExtractLinesWithDir.java