tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.85k stars 429 forks source link

Extraction of tables might include digital watermark #517

Open skwskwskwskw opened 1 year ago

skwskwskwskw commented 1 year ago

I am working on a PDF file which might include watermark when extracting the table. The watermark might occur at different locations. 2 approaches I am thinking but I am not sure how to approach it:

  1. Dont extract words that are rotated.
  2. When extracting, it should be absolute location of watermark as seen on PDF - but the tabula defined the watermark at different location.

The watermark looks like this (the number that is rotated):

image

germainepym commented 1 year ago

Hey, just wondering if you managed to find a solution/ workaround for the problem? I have a similar PDF that have a text watermark at the side too