Closed zhongguogu closed 5 years ago
Please consider all classes in this repository in their respective context. The context usually is apparent from its JavaDoc. For PDFVisibleTextStripper
this says
/**
* <a href="https://stackoverflow.com/questions/47358127/remove-invisible-text-from-pdf-using-pdfbox">
* remove invisible text from pdf using pdfbox
* </a>
* <br/>
* <a href="https://drive.google.com/file/d/1F8vrzcABwxVGdN5W-7etQggY5xKtGplU/view">
* RevTeaser09072016.pdf
* </a>
* <p>
* This class extends the {@link PDFTextStripper} to ignore text hidden by
* clipping or by covering with a filled path.
* </p>
* <p>
* The {@link PDFTextStripper} does not implement call backs for path related
* instructions but the {@link PageDrawer} does. So we borrow code from there
* to implement path related behavior here.
* </p>
*
* @author mkl
*/
So this class only ignores text hidden by clipping or by covering with a filled path, not other hidden text. For an extractor that recognizes all kinds of invisible text as ignorable there is much more to implement, cf. the answers to many similar questions on SO, in particular those by Dmitry K.
The "hidden text" in question, though, is neither hidden by a clip path nor covered by some filled path, it simply is a glyph visually without any content. On the other hand there is a ToUnicode entry for it mapping it to U+DBD0 during text extraction, a High Private Use Surrogate which by itself in general makes no sense; after text extraction it therefore usually will show up as a '�', i.e. a replacement character.
Adobe most likely considers this "hidden text" because it is an empty glyph with a ToUnicode mapping to a codepoint outside the 'Separator, Space' category.
Hi,There is one case that fails to remove invisible text by PDFVisibleTextStripper.java. In PDF page One.
00000000000005fw6q.pdf