Closed zdw1011781461 closed 3 years ago
What do you mean by "outline"?
The HTML file in the zip package has no red outline
Ah, understand you now.
You are referring to a PDF feature, that lets you have different colors for the character border and the interior. This is achieved by PDF command 2 Tr
("set text rendering" to 2). This will show the characters including a border with the currently active border color and border width.
The default 0 Tr
only shows characters with their fill color.
This technique is often used to simulate bold text without having to use a different font.
PyMuPDF fully supports this technique on text output only (`Page.insert_text()
and TextWriter.write_text()
).
In text extraction, we only get the fill color as span["color"]
dictionary key. We do not get the border color, and neither the currently active border width.
I unfortunately have no way to extend the amount of information extracted by the base library MuPDF.
In addition, text extraction in formats HTML, XML and XHTML are thin wrappers for native MuPDF code, I have no option to influence this in any way.
I suggest you submit a feature request to MuPDF / Artifex.
You can try to find this information using low-level code like below ... very tedious and error prone!
I have marked the relevant commands with % <==
comments for your better understanding.
>>> cont=page.read_contents().decode()
>>> print(cont)
1.00000 0.00000 0.00000 1.00000 0.0000 0.0000 cm
/GS9 gs
0 J
0 j
0.5669 w
22.925585626053735 M
[] 0 d
0.12 0.10 0.09 RG
0.0000 841.8898 m
595.2756 841.8898 l
595.2756 0.0000 l
0.0000 0.0000 l
h
s
/GS9 gs
0 J
0 j
0.2160 w % <== set border width
22.925585626053735 M
[] 0 d
0.85 0.15 0.11 RG % <== set RGB border color (some red)
0.12 0.10 0.09 rg % <== set RGB fill color (some dark gray)
BT % <== begin text object
2 Tr % <== set text rendering mode 2
116.5669 459.0434 TD
/F10 66.5330 Tf
<6d4b>Tj
70.4693 0.0000 TD
<8bd5>Tj
70.4696 0.0000 TD
<63cf>Tj
70.4696 0.0000 TD
<8fb9>Tj
70.4693 0.0000 TD
<6587>Tj
70.4693 0.0000 TD
<672c>Tj
ET
q
100 641.89 100 100 re
0 w
h
0.117647 0.0980392 0.0862745 rg f
Q
Thanks for your help
@zdw1011781461 FYI: I am developping a way to extract the information you wanted to see.
I hope I can make it a part of the next version.
It behaves similar to text extraction page.get_text("rawdict")
but delivers additional details on the text. Here is a print out of your example page. The basic fact to acknowledge is, that for text, that has bordered characters, two separate spans are produced: one span with the inner part of the text using the fill color, and the other span with the same text, but with the border color. These two spans are exactly equal Python dictionaries with two differences:
"type": 0
and the fill color encoded in the "color"
key."type": 1
and according color in the "color"
key.The characters in a span are in a list wich contains (int) unicode, (int) glyph id, origin (2-tuple) and character width at fontsize 1.
>>> pprint(page._getTexttrace())
[{'ascender': 1.05810546875, # font property
'bidi': 0, # technical info, ignore this
'chars': ((27979, 2169, (116.56690216064453, 459.04339599609375), 1.0),
(35797, 2721, (187.03619384765625, 459.04339599609375), 1.0),
(25551, 1876, (257.50579833984375, 459.04339599609375), 1.0),
(36793, 2822, (327.97540283203125, 459.04339599609375), 1.0),
(25991, 1926, (398.4447021484375, 459.04339599609375), 1.0),
(26412, 1998, (468.91400146484375, 459.04339599609375), 1.0)),
'color': (0.11999999731779099, 0.10000000149011612, 0.09000000357627869),
'colorspace': 3, # indicates RGB
'descender': -0.26171875, # font property
'font': 'IQDUKX+MicrosoftYaHei', # font name
'linewidth': 0.5669000148773193, # border width of the characters
'opacity': 1.0, # transparency ... finally!
'origin': (116.56690216064453, 459.04339599609375), # origin of the span (equal to 1st char)
'scissor': (1.0, 1.0, -1.0, -1.0), # technical, ignore this
'size': 66.53299713134766, # fontsize
'transform': (1.0, 0.0, 0.0, -1.0, 0.0, 841.8897705078125), # technical ignore this
'type': 0, # fill text
'wmode': 0}, # writing mode: horizonatl
{'ascender': 1.05810546875,
'bidi': 0,
'chars': ((27979, 2169, (116.56690216064453, 459.04339599609375), 1.0),
(35797, 2721, (187.03619384765625, 459.04339599609375), 1.0),
(25551, 1876, (257.50579833984375, 459.04339599609375), 1.0),
(36793, 2822, (327.97540283203125, 459.04339599609375), 1.0),
(25991, 1926, (398.4447021484375, 459.04339599609375), 1.0),
(26412, 1998, (468.91400146484375, 459.04339599609375), 1.0)),
'color': (0.8500000238418579, 0.15000000596046448, 0.10999999940395355),
'colorspace': 3,
'descender': -0.26171875,
'font': 'IQDUKX+MicrosoftYaHei',
'linewidth': 0.5669000148773193,
'opacity': 1.0,
'origin': (116.56690216064453, 459.04339599609375),
'scissor': (1.0, 1.0, -1.0, -1.0),
'size': 66.53299713134766,
'transform': (1.0, 0.0, 0.0, -1.0, 0.0, 841.8897705078125),
'type': 1, # stroke text (the text border)
'wmode': 0}]
>>>
Information like the above will also help me not only determine transparency, but also whether text is hidden: this will have yet another "type"
value.
I understand your meaning and look forward to your updated version. Thank you
outline.zip `import fitz
doc = fitz.open("outline.pdf") page = doc[0]
text_dict = page.get_text('dict') text_html = page.get_text('html') print('text_dict:', text_dict) with open('outline.html', 'w') as wf: wf.write(text_html)` I use the 'dict' to get only text, fonts, box, I want to get the outline of the text what do I do?