pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.63k stars 524 forks source link

How do I get the outline of a text #1080

Closed zdw1011781461 closed 3 years ago

zdw1011781461 commented 3 years ago

outline.zip `import fitz

doc = fitz.open("outline.pdf") page = doc[0]

text_dict = page.get_text('dict') text_html = page.get_text('html') print('text_dict:', text_dict) with open('outline.html', 'w') as wf: wf.write(text_html)` I use the 'dict' to get only text, fonts, box, I want to get the outline of the text what do I do?

JorjMcKie commented 3 years ago

What do you mean by "outline"?

zdw1011781461 commented 3 years ago

outlines

The HTML file in the zip package has no red outline

JorjMcKie commented 3 years ago

Ah, understand you now. You are referring to a PDF feature, that lets you have different colors for the character border and the interior. This is achieved by PDF command 2 Tr ("set text rendering" to 2). This will show the characters including a border with the currently active border color and border width. The default 0 Tr only shows characters with their fill color. This technique is often used to simulate bold text without having to use a different font.

PyMuPDF fully supports this technique on text output only (`Page.insert_text() and TextWriter.write_text()).

In text extraction, we only get the fill color as span["color"] dictionary key. We do not get the border color, and neither the currently active border width. I unfortunately have no way to extend the amount of information extracted by the base library MuPDF.

In addition, text extraction in formats HTML, XML and XHTML are thin wrappers for native MuPDF code, I have no option to influence this in any way.

I suggest you submit a feature request to MuPDF / Artifex.

You can try to find this information using low-level code like below ... very tedious and error prone! I have marked the relevant commands with % <== comments for your better understanding.

>>> cont=page.read_contents().decode()
>>> print(cont)
1.00000 0.00000 0.00000 1.00000 0.0000 0.0000 cm
/GS9 gs
0 J
0 j
0.5669 w
22.925585626053735 M
[] 0 d
0.12 0.10 0.09 RG
0.0000 841.8898 m
595.2756 841.8898 l
595.2756 0.0000 l
0.0000 0.0000 l
h
s
/GS9 gs
0 J
0 j
0.2160 w  % <== set border width
22.925585626053735 M
[] 0 d
0.85 0.15 0.11 RG  % <== set RGB border color (some red)
0.12 0.10 0.09 rg  % <== set RGB fill color (some dark gray)
BT  % <== begin text object
2 Tr  % <== set text rendering mode 2
116.5669 459.0434 TD
/F10 66.5330 Tf
<6d4b>Tj
70.4693 0.0000 TD
<8bd5>Tj
70.4696 0.0000 TD
<63cf>Tj
70.4696 0.0000 TD
<8fb9>Tj
70.4693 0.0000 TD
<6587>Tj
70.4693 0.0000 TD
<672c>Tj
ET
q
100 641.89 100 100 re
0 w
h
0.117647 0.0980392 0.0862745 rg f
Q
zdw1011781461 commented 3 years ago

Thanks for your help

JorjMcKie commented 3 years ago

@zdw1011781461 FYI: I am developping a way to extract the information you wanted to see. I hope I can make it a part of the next version. It behaves similar to text extraction page.get_text("rawdict") but delivers additional details on the text. Here is a print out of your example page. The basic fact to acknowledge is, that for text, that has bordered characters, two separate spans are produced: one span with the inner part of the text using the fill color, and the other span with the same text, but with the border color. These two spans are exactly equal Python dictionaries with two differences:

The characters in a span are in a list wich contains (int) unicode, (int) glyph id, origin (2-tuple) and character width at fontsize 1.

>>> pprint(page._getTexttrace())
[{'ascender': 1.05810546875,  # font property
  'bidi': 0,  # technical info, ignore this
  'chars': ((27979, 2169, (116.56690216064453, 459.04339599609375), 1.0),
            (35797, 2721, (187.03619384765625, 459.04339599609375), 1.0),
            (25551, 1876, (257.50579833984375, 459.04339599609375), 1.0),
            (36793, 2822, (327.97540283203125, 459.04339599609375), 1.0),
            (25991, 1926, (398.4447021484375, 459.04339599609375), 1.0),
            (26412, 1998, (468.91400146484375, 459.04339599609375), 1.0)),
  'color': (0.11999999731779099, 0.10000000149011612, 0.09000000357627869),
  'colorspace': 3,  # indicates RGB
  'descender': -0.26171875,  # font property
  'font': 'IQDUKX+MicrosoftYaHei',  # font name
  'linewidth': 0.5669000148773193,  # border width of the characters
  'opacity': 1.0,  # transparency ... finally!
  'origin': (116.56690216064453, 459.04339599609375),  # origin of the span (equal to 1st char)
  'scissor': (1.0, 1.0, -1.0, -1.0),  # technical, ignore this
  'size': 66.53299713134766,  # fontsize
  'transform': (1.0, 0.0, 0.0, -1.0, 0.0, 841.8897705078125),  # technical ignore this
  'type': 0,  # fill text
  'wmode': 0},  # writing mode: horizonatl
 {'ascender': 1.05810546875,
  'bidi': 0,
  'chars': ((27979, 2169, (116.56690216064453, 459.04339599609375), 1.0),
            (35797, 2721, (187.03619384765625, 459.04339599609375), 1.0),
            (25551, 1876, (257.50579833984375, 459.04339599609375), 1.0),
            (36793, 2822, (327.97540283203125, 459.04339599609375), 1.0),
            (25991, 1926, (398.4447021484375, 459.04339599609375), 1.0),
            (26412, 1998, (468.91400146484375, 459.04339599609375), 1.0)),
  'color': (0.8500000238418579, 0.15000000596046448, 0.10999999940395355),
  'colorspace': 3,
  'descender': -0.26171875,
  'font': 'IQDUKX+MicrosoftYaHei',
  'linewidth': 0.5669000148773193,
  'opacity': 1.0,
  'origin': (116.56690216064453, 459.04339599609375),
  'scissor': (1.0, 1.0, -1.0, -1.0),
  'size': 66.53299713134766,
  'transform': (1.0, 0.0, 0.0, -1.0, 0.0, 841.8897705078125),
  'type': 1,  # stroke text (the text border)
  'wmode': 0}]
>>> 

Information like the above will also help me not only determine transparency, but also whether text is hidden: this will have yet another "type" value.

zdw1011781461 commented 3 years ago

I understand your meaning and look forward to your updated version. Thank you