pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.96k stars 930 forks source link

Don't display suppressed glyphs in text output #967

Open norbusan opened 4 months ago

norbusan commented 4 months ago

Feature request

This feature is about PDF files generated by LaTeX in the OT1 encoding. In this encoding, the Polish slashed-l is created by

The peculiar point is that the OT1 encoding of the original TeX fonts have the small slash in code point 0x20 (space)!

With a simple TeX file

\documentclass{article}
\begin{document}
Ha\l{}\l{}o
\end{document}

compiled using pdflatex I get the following output from pdf2txt in the master branch:

Ha(cid:32)l(cid:32)lo

1

Mind the (cid:32).

Investigating the encoding of embedded font in the pdf file (by uncompressing the pdf file and looking at the source):

stream
%!PS-AdobeFont-1.0: CMR10 003.002
...
/Encoding 256 array
0 1 255 {1 index exch /.notdef put} for
dup 72 /H put
dup 97 /a put
dup 108 /l put
dup 111 /o put
dup 49 /one put
dup 32 /suppress put
...

we see that the glyph 32 (space, hex 0x20) is marked as suppress.

It seems that poppler understands this, since pdftotext (from the poppler utilsandokular(viewer) show onlyHallo` when doing copy/paste or extracting text.

Would it be possible to support this in pdfminer, too?