mkl-public / testarea-pdfbox2

Test area for public PDFBox v2 issues on stackoverflow etc
Apache License 2.0
82 stars 44 forks source link

How to extract all the lines in PDF? #7

Closed tendrillion closed 3 years ago

tendrillion commented 3 years ago

Not all lines are extracted

pdfbox 2.0.21 linux 18.04 java:jdk1.8.0

test file:src/test/resources/mkl/testarea/pdfbox2/extract/demo.pdf test code:src/test/java/mkl/testarea/pdfbox2/extract/ExtractLinesWithDir.java

demo

  1. Using this method can extract some lines, but not all lines. In the figure, the green line represents the extracted line, and the red "?"mark represents the non extracted line.
  2. How to extract all the lines in PDF?
mkl-public commented 3 years ago

Please be aware that the code in this repository, in particular the code in unit tests in the src/test folder, targets specific problems, mostly formulated in stack overflow questions. Thus, the code in the tests does not represent finished implementations for a task for generic inputs, it only fixes (or tries to fix) the specific issue the original poster has, often for specific inputs. Usually this also reflects in the JavaDoc comments.

For example the ExtractLinesWithDir test class focusses on the question "PDFBox - Line / Rectangle extraction". Here the asker uses a specific class (the LineCatcher posted by @Tilman Hausherr in this answer to a different question) and has issues with the coordinates returned by that class.

My improved version in the ExtractLinesWithDir test method testExtractLineRotationTestWithDir of the OP's code shows how to interpret the coordinates returned by that LineCatcher class to also cope with page rotation and the OP's preference of coordinate systems (compare the method comment: this method attempts to extract the coordinates as the OP wants it, i.e. the coordinates of the top left point of the line bounding boxes on the rotated page, the origin in the upper left corner of the page).

But I had not checked how effective the LineCatcher class is to catch all lines, let alone improved that effectiveness. At first glance it becomes clear that that class only looks for lines drawn by stroking a path, and then returns the bounding box of that path.

In the context of the file demo.pdf you applied the code to, though, some lines are not drawn by stroking a path but instead by filling a path, a long and slim rectangle. For example the top line in demo.pdf is a filled rectangle at (98.318, 1514.968), 1020.599 units wide, 0.266 units high.

You can extend the LineCatcher class to also catch such lines by also adding the boundary box of the linePath to the rectList in fillPath and fillAndStrokePath like it's already done in strokePath. But beware, you will only want to collect such paths if they indeed only fill a line, i.e. something long in one direction and slim in the other.

Additionally lines in PDFs may be drawn using other techniques, too. E.g. a bitmap containing a single pixel might be drawn as a line by stretching it using the current transformation matrix. Or there might be an actual bitmap of a line drawn regularly. This bitmap does not even need to be slim as there might be empty space around the line in the bitmap. So those options also have to be considered. And there surely are more funny ways to draw lines used by some people.

Concerning "How to extract all the lines in PDF", therefore, you'll have some work ahead of you, adding all such options to the code. If you need to catch all kinds of lines and at the same time ignore invisible ones (e.g. lines drawn white on white), you probably should render the PDF pages as bitmaps and apply apply image analysis to those bitmaps.

If you have more specific questions, e.g. how to extract a specific type of lines from a specific PDF, please ask on stack overflow, probably referring to the original question or this issue here, and do share the PDF in question.