pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.24k stars 499 forks source link

misc #632

Closed harveyspecter09 closed 4 years ago

harveyspecter09 commented 4 years ago

Hi @JorjMcKie hope you are doing good and having good time. congratulations for font replacement functionality its awesome.

I have few doubts it would be grateful if you provide your take on it.

1). When replacing base 14 font names with 'Keep' for certain fonts in the csv ,the final output is showing names as NimbusMonoPS (for cobo),NimbusRoman-regular(for times-roman) , NimbusRoman-bold(for tibo) , Nimbus Sans Regular(for helvetica) unsure why Nimbus names are coming into context. i thought excat names will show up. test0-formatted.pdf test1-formatted.pdf test2-formatted.pdf

Please do correct me if i am wrong i believe we have to replace 'keep' with only Base14 supported fonts (https://pymupdf.readthedocs.io/en/latest/app4.html#base-14-fonts)

2). Noticed for unembedded fonts there are no glyph details,typefaces is that the reason why line of code will not work to replace not embedded fonts to a new base 14 supported font?

3). In repl-font.py could not understand below line of code

''' print("Building font subsets:") for fontname in font_subsets.keys(): msg = "Used %i glyphs of font '%s'." % (len(font_subsets[fontname]), fontname) old_buffer = font_buffers[fontname]

new_buffer = build_subset(old_buffer, font_subsets[fontname])
if new_buffer is not None:
    s = round((len(old_buffer) - len(new_buffer)) / 1024)
    msg += " %g KB saved." % s
    font_buffers[fontname] = new_buffer
else:
    msg += " Cannot subset!"
print(msg)
del old_buffer

'''

tested with few sample pdfs for all scenarios( msg += " Cannot subset!" is coming up) can i know on when this else condition will be omitted( i knew if 'if condition satisfies' it will but unsure about it since not clue on font subsets information.

  1. I am unclear with build_subset method significance.

it would be generous of you if share your insights and comments.

harveyspecter09 commented 4 years ago

it would be helpful if you provide additional info on build_subset method its role and significance in the script

JorjMcKie commented 4 years ago

As a lead-in comment: this script is still somewhat in beta - so I am grateful for comments and experience reports!

Confirming your comments on Base-14 stuff:

I am still working on this script. Just took an old test PDF which used the Droid Sans Fallback font. It used to have a size of 1.6 MB just for a few lines of English / Chinese text. Then I executed the script on it, replacing this font with itself et voilà, the size came down to below 6 KB!

JorjMcKie commented 4 years ago

it would be helpful if you provide additional info on build_subset method its role and significance in the script

I agree. As I said: this script is new and somewhat beta still.

JorjMcKie commented 4 years ago

Maybe a more general coment is adequate:

Font subsetting happens all over the place with converter software doing "XXX to PDF" conversions. They can do so, because they know the the overall set of characters (glyphs) ever used in the document to convert. If you use PyMuPDF methods to insert text, you don't know that: you don't know whether this is the first or the last such insert for this text / font combination. Only when you have a finalized PDF, you can be sure about the final set of characters used. Therefore the font replacement script is in the same position as those PDF converters. I am scanning through all text and build sets of used unicodes per font to be replaced. For every replacement font, I build a corresponding subset and use this subset when I re-insert the respective text.

As mentioned before, font replacement doesn't need to be taken literally: If you take the same font again to replace itself, you will end up with a new PDF, looking exactly equal (hopefully), but smaller ...

JorjMcKie commented 4 years ago

Please do correct me if i am wrong i believe we have to replace 'keep' with only Base14 supported fonts.

You can, but yu do not need to. You can decide to replace Helvetica... by "helv"/"heit", etc. The outcome will be a PDF with (non-subsettable) "Nimbusxxx" fonts. Or choose just another font of you liking. If you do replace them, this is only possible with embedded fonts - whatever they are and where they come from.

harveyspecter09 commented 4 years ago

thanks for the lightening fast reply @JorjMcKie i wish mentor like you should be for every learner when they start their professional carrier , it will be grateful for anyone with the support you are providing.

Last but not the least just came across CMap can we analyze font CMap and return proper unicode format for that font, i knew that i am sounding vague but i hope with your expertise you got my query.

  1. Is there a way to convert encoding/type from one format to another ( for eg Indentity-H encoding to Ansi, TrueType to Type1 )
JorjMcKie commented 4 years ago

Very nice feedback, indeed, thank you!

can we analyze font CMap

There is a fitz.Font method which returns all unicodes defined in the font. E.g.

>>> font = fitz.Font("spacemo")  # my favorite monospaced font
>>> vuc = font.valid_codepoints()  # all characters this font has
>>> len(vuc)
613
>>> chr(vuc[100])
'¤'
>>> chr(vuc[200])
'ĉ'
>>> chr(8217)
'’'
>>> vuc.index(8217)
526
>>> font.unicode_to_glyph_name(8217)
'quoteright'
>>> 
JorjMcKie commented 4 years ago

Is there a way to convert encoding/type from one format to another ( for eg Indentity-H encoding to Ansi, TrueType to Type1 )

I suppose there is, I have seen some such tables on the web site of the Unicode Consortium, I believe. It is just not part of PyMuPDF.

JorjMcKie commented 4 years ago

The method font.valid_codepoints() represents the "right hand side" of the CMAP table. Here is an example:

>>> page=doc[0]
>>> for f in page.getFontList(): print(f)  # list of fonts on the page

(15, 'ttf', 'Type0', 'Noto Sans Bold', 'F0', 'Identity-H')  # just take this example
(21, 'ttf', 'Type0', 'Noto Sans Regular', 'F1', 'Identity-H')
(27, 'ttf', 'Type0', 'Space Mono Regular', 'F101', 'Identity-H')
(32, 'otf', 'Type0', 'Noto Sans Symbols Regular', 'F10', 'Identity-H')
>>> print(doc.xrefObject(15))
<<
  /Type /Font
  /Subtype /Type0
  /BaseFont /Noto#20Sans#20Bold
  /Encoding /Identity-H
  /ToUnicode 16 0 R  % this is the CMAP, a "stream" object
  /DescendantFonts [ 20 0 R ]
>>
>>> print(doc.xrefStream(16).decode())  # a bytes object, so decode with utf-8
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <</Registry(Adobe)/Ordering(UCS)/Supplement 0>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
8 beginbfrange  % 0x30, 0x31, 0x43, 0x44, ... appear in font.valid_codepoints()
<0003> <0004> <0030>
<0005> <0008> <0043>
<000a> <000b> <004c>
<000c> <000d> <004f>
<000f> <0015> <0063>
<0016> <001b> <006b>
<001c> <0020> <0072>
<0021> <0022> <0078>
endbfrange
4 beginbfchar
<0001> <0020>
<0002> <002e>
<0009> <0049>
<000e> <0061>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

>>> 

As a side note: This PDF had been created with the font replacing script 😎, therefore just a few unicode values appear in the remaining CMAP ...

JorjMcKie commented 4 years ago

The above example, "Noto Sans Bold" is a font with compressed over 270 KB size. After subsetting with those few characters written with the font, the subset size in the PDF is only about 6 KB compressed. This the value of subsetting.