compiled using pdflatex I get the following output from pdf2txt in the master branch:
Ha(cid:32)l(cid:32)lo
1
Mind the (cid:32).
Investigating the encoding of embedded font in the pdf file (by uncompressing
the pdf file and looking at the source):
stream
%!PS-AdobeFont-1.0: CMR10 003.002
...
/Encoding 256 array
0 1 255 {1 index exch /.notdef put} for
dup 72 /H put
dup 97 /a put
dup 108 /l put
dup 111 /o put
dup 49 /one put
dup 32 /suppress put
...
we see that the glyph 32 (space, hex 0x20) is marked as suppress.
It seems that poppler understands this, since pdftotext (from the poppler utilsandokular(viewer) show onlyHallo` when doing copy/paste or extracting text.
Would it be possible to support this in pdfminer, too?
Feature request
This feature is about PDF files generated by LaTeX in the OT1 encoding. In this encoding, the Polish slashed-l is created by
l
The peculiar point is that the OT1 encoding of the original TeX fonts have the small slash in code point 0x20 (space)!
With a simple TeX file
compiled using
pdflatex
I get the following output frompdf2txt
in the master branch:Mind the
(cid:32)
.Investigating the encoding of embedded font in the pdf file (by uncompressing the pdf file and looking at the source):
we see that the glyph 32 (space, hex 0x20) is marked as
suppress
.It seems that
poppler
understands this, sincepdftotext
(from the poppler utilsand
okular(viewer) show only
Hallo` when doing copy/paste or extracting text.Would it be possible to support this in pdfminer, too?