mkl-public / testarea-pdfbox2

Test area for public PDFBox v2 issues on stackoverflow etc
Apache License 2.0
82 stars 44 forks source link

One case fails to remove invisible texts or symbols #3

Closed zhongguogu closed 5 years ago

zhongguogu commented 5 years ago

Hi,There is one case that fails to remove invisible text by PDFVisibleTextStripper.java. In PDF page One.

00000000000005fw6q.pdf invisible

mkl-public commented 5 years ago

Please consider all classes in this repository in their respective context. The context usually is apparent from its JavaDoc. For PDFVisibleTextStripper this says

/**
 * <a href="https://stackoverflow.com/questions/47358127/remove-invisible-text-from-pdf-using-pdfbox">
 * remove invisible text from pdf using pdfbox
 * </a>
 * <br/>
 * <a href="https://drive.google.com/file/d/1F8vrzcABwxVGdN5W-7etQggY5xKtGplU/view">
 * RevTeaser09072016.pdf
 * </a>
 * <p>
 * This class extends the {@link PDFTextStripper} to ignore text hidden by
 * clipping or by covering with a filled path.
 * </p>
 * <p>
 * The {@link PDFTextStripper} does not implement call backs for path related
 * instructions but the {@link PageDrawer} does. So we borrow code from there
 * to implement path related behavior here.
 * </p>
 * 
 * @author mkl
 */

So this class only ignores text hidden by clipping or by covering with a filled path, not other hidden text. For an extractor that recognizes all kinds of invisible text as ignorable there is much more to implement, cf. the answers to many similar questions on SO, in particular those by Dmitry K.

The "hidden text" in question, though, is neither hidden by a clip path nor covered by some filled path, it simply is a glyph visually without any content. On the other hand there is a ToUnicode entry for it mapping it to U+DBD0 during text extraction, a High Private Use Surrogate which by itself in general makes no sense; after text extraction it therefore usually will show up as a '�', i.e. a replacement character.

Adobe most likely considers this "hidden text" because it is an empty glyph with a ToUnicode mapping to a codepoint outside the 'Separator, Space' category.