Closed LeonMelis closed 4 years ago
I'm trying to create the PDF file based on the info you provided, by hand. Can you provide the font identified by g_d0_f1
(this is an auto-generated name, the actual font may be different). In the PDF, the font is identified by /F2
. Look for something along the lines of <</Font<</F2<<...>>>>
@Rob--W recreating the PDF by hand... man, I really admire your help! It really sucks I'm in no position to share the PDF :(
Anyway, from the Page object I understand that font F2 is object 7, gen 0:
3 0 obj
<</Type /Page
/Parent 2 0 R
/MediaBox [0 0 595 842]
/Contents 9 0 R
/Resources <</Font <</F1 6 0 R
/F2 7 0 R>>
/XObject <</I1 8 0 R>>
/ProcSet [/PDF /Text /ImageC]>>
/Annots [11 0 R]>>
endobj
That should be this part:
7 0 obj
<</Type /Font
/Subtype /TrueType
/BaseFont /PXAAAB+Helvetica,Bold
/FirstChar 1
/LastChar 41
/Widths [277 277 333 277 556 556 556 556 333 975 722 833 777 666 610 666 556 610 556 610 556 333 610 610 277 277 556 277 889 610 610 610 389 556 333 610 556 777 556 500 744]
/FontDescriptor 18 0 R
/ToUnicode 19 0 R>>
endobj
Extracted via getTextContent():
"g_d0_f1": {
"fontFamily": "sans-serif",
"ascent": 0.77001953125,
"descent": -0.22998046875
},
"g_d0_f2": {
"fontFamily": "sans-serif",
"ascent": 0.77001953125,
"descent": -0.22998046875
}
Again, I'm really sorry I can't supply the PDF!
I ran into PDF files where bold text (or at least thicker than rest of the text) seems to be created by placing the same words over each other with a slight offset on the X axis. I have never seen this behaviour in other PDF's, so I only assume it's done to create the 'bold' effect.
To me, this would suggest that the bug is in the PDF file itself, i.e. the PDF generator is creating a broken file, since that isn't how you're supposed to create bold text. The usual way it's done is by using different fonts for the regular and bold text. (We've also seen a case where different text rendering modes were used to create "bold" text.) I'm assuming that you don't have control over what software was used to create the PDF file, but if you do I'd advice you to use better a PDF generator.
Again, I'm not sure if this is a bug or intended behaviour.
In this case, it's probably intended behaviour, but let me try to elaborate.
Please note the primary use-case for getTextContent
is to enable text-selection, copying, and searching in the e.g. the default PDF.js viewer. Since many PDF files unfortunately uses separate showText commands, with moveText/setTextMatrix commands in between to simulate spaces, this often leads to less desirable text-selection behaviour.
In order to avoid this, we're using a couple of heuristics to attempt to combine (same line) showText commands. Most likely this explains what you're describing above, please see e.g. https://github.com/mozilla/pdf.js/blob/e908b71309bddd9656cfd6bfc86054eaab3c1974/src/core/evaluator.js#L1419 and https://github.com/mozilla/pdf.js/blob/e908b71309bddd9656cfd6bfc86054eaab3c1974/src/core/evaluator.js#L1452.
18 0 R
and 19 0 R
are also part of the font definition, please post them.
@Snuffleupagus Thanks for the explanation, the whole 'fakespaces' thing I wondered about in the source makes sense now. I have removed the 'optimization' parts from OPS.moveText
and OPS.setTextMatrix
as you suggested and now I get the expected result!
@Rob--W Looks like you'll need object 18 until 22. Object 19 & 20 are compressed (FlateDecode) streams, so I'm affraid pasting that binary data here will not work, but I'll give it a try anyway. I have also attached the decompressed objects 19 and 20 to this post.
18 0 obj
<</Type /FontDescriptor
/FontName /Helvetica,Bold
/Ascent 770
/Descent -229
/ItalicAngle 0
/StemV 0
/CapHeight 1474
/Flags 4
/FontBBox [-1017 -480 1436 1159]
/FontFile2 20 0 R>>
endobj
19 0 obj
<</Length 21 0 R
/Filter /FlateDecode>>
stream
x]Íjã0
÷~
-;bKI¤ah)d1?Lf@®SC#ÅYäíGG't`1Ð=×äöåðzHÓªÚyGYÕ8¥å2_s5ÈiJ6*Na½¯ê3ýÒ´%|¼]V9Ò87û½jÍËoêákùÒ´?r<¥zøór,ëãuY>ä,iU]Ó÷*ÊX}óËwÕÖØã!ýi½=Ì¿ß·E©kM0G¹,>Höé$;ëzµ{ëIñ¿½í=2áÝçRªKi×®/lȼ!Gð,à]åM·dvä
ø¼?=ØWÞÖì@®Ù@®ïä,wµçHÞÖÙéo5þ=5ý-²þ¶féoá¦éokú[¦¿}Óß>éoqM;éoqoþgÑô·¸7Msú;Ü¡¿³¡¿³¡¿³¡¿³¡¿³¡¿³¡¿§¡¿§¿é|¨ÃpÿêLîç¼
kÎeÔêx×ÃtMI>ÿ˼ UÉÓÏ
endstream
endobj
21 0 obj
405
endobj
20 0 obj
<</Length 22 0 R
/Filter /FlateDecode
/Length1 15332>>
stream
xÝ{y|U¶ð½µö^Òk®NwÒÙ;YR Ù BXLHhH
6Q¸Î°¾§(:ØÀàcT@x>QÌè8q
®~§ªæcæçó{|)nÝsºçܳÝsnÂ!Z
ÈÐÆÙm¾£ï~üô@ëg÷tsóJSx?E¼onǼ¶¿{ÿ¾!º!
kÞesëþòJBQâÆ·ø}sÍ»ç}X#¿:ä÷±Yei¡íjië^z HSíÐ~vAûl"jöYh¿Ðæ[Ú!ûRPö:hs}mþ®ã¿ã ý,B̸öEÝ¡äG¨`Pïèòwë×ÏE¨0èû3ôa$îGÜÂØ&HfX\¡T©5QZÞm4-V[Ll\¼s$8]Iîäôÿé}éè×P½ÅQe(¡ÐP>karèúR¡$Ärî
F/!½îDzíÂräD8}ãp*z è#ô?È6 'á].á}aN>Zþ=ê@¨KFFT¾
ýÊP/:YlÀq¡ýÈÖ³
=UĬÐndAãÑÐÆÕèú0Ôú
ÖÏG°{¨¢ÐÇ@4ôxÑz´½ØSñ¡Ðog ]¡ÚP|÷
Ìò h`ûÛqNÃÛð'ä@hUèØ[,ME³áiCw£èqô¢4kKaýrTc\BßR¥à2¼xüü*¢¶
S_3zÀçàü"Þÿ
é%ߣ:¨í@ÛT´mGÐè]ô1ºÐ/() i4WàÿïþÈ%ÄFâCâ2üb©
ô½ôÁz/ôÐRQªBPØÍEÑbtº³hÚþÔCç°k±gã*<ßoÇËÐÃx>Ïâóø"þ¨3vÂIxÀ·XO¼HôûRGv+ÉÃä'äj¢ÃsÎ »X¦$üV8Ê=Úr1ÁãB)(Æp±
Ý\<{í@ϣߣ~Ôº½èzèú}~ÅÂãÀ9¸×áI@áÜïÂ[Â]xPyDgð|Y 9AÜJøeðlC[w%þ¨HLf5ääÐßÈÉÝäwT"5ê¤VP½ÔVê :EßBO§;èMô>ú-ú/ô7ô Ǭev0{wY;ÝÊ
8háp"Ú^ÛLv@Û
Æà{@ªÓÐ ÐÞô:º~FÑNRfRh;
Ö4¡É;P1zø
1.TB>KÊqNèGX+äuíA|jJ²;)ÑåLppöø¸ØÕb6£
z6J£V)rËÐI`^á¬læúû¨$guuØvú Ã7¬£¹®ÊçôqÍÒ4îÆ<Ìû3ùðL~h&ÖrŨ8#«pr}'Ë\ OÔ ðýåÎF®o@k%ø! VìpÀ\
¥¥ëÃÍ\E_eOKoEs9,·÷£ÈHGûa÷H).ÜÆøV¶X gTôÙå}Vg¹4F&VøæôÕMj¨(q83ÒûðÙÎY}ÈYÖù\ü©õ
;#½U¤mPÍqÎÙàѬfòÍhè#}}D³CÖgv÷_°\o^*6ì#+}þÞJ`Íêp³Ylù6B«f2Ë÷66ôá{ÃDH´wáwV=Íó¹>¹³ÌÙÒ;¿xêúm¼ÂÙ\ÞØêú¼Ujd¤ï·ÜYä ¦ìÏ(Í(ë"åÎpýÅpÿ;Ò¼#B]S?Ä,br2û¸Ù¸®h-_þBÔ;»¦ÁO#]¶=cúP%2±Nëë[59B¯¥<BÜüò~¹Õ&¬æ7÷jG¯ur½ß#¬sàë{|&Qû=AQþC*ã×à1":³E_OE¤í´Të6tgPjzM Éëvcü@c î
ò¸ýpÀ3oá4QáZË4ÒÓ¡#Õ`u*EÍàz¹Þ±sz¹J®TJjð÷6za-hJ£oý#aLqJZ§·VYa¾´,IôØUR]ä¾Uå1}|y#04ôp]CßaSc#ÌÊ¢êÍÙ@sV* 9áU&ðDcoo¸åtôîíé--Ü`ô|¤#¤¼ªNZåtÄH,v8@V£ÈÓ\Pàk
@#þ5ós8¨Í8\ðoâpá¯á°÷WqxäÍ9\49\üÇáQ7pxô¿æpÉpó@mÄáÒË~
Çü*ßÃ@s¹ÈáÊÿ;WÝÀáêÍá±Ã9<¨+q¸æßÄáñ¿Ãµ¿ÃnÎá@óÃuÿwtëÿ5'çð v²Äá©ÿ&Oû5¾åWq¸áænDOâ0ÓsxÕ?0ýÛY~ë0C¦°
2¶Q¯ïæ4ñ ÅÆHASq$IØäU&ßåXPlIK0X\, ý¡¸V,F%ÅÁb±dgåê:7mÔS«'é×~ ê¯ü^L
1êx;©QoGUÈ(u'L)¢Ìf\ANg:eÕDu8îÂ1àõN¨ð_¬E%:¯w ;Ë<8E4Ãj°³gÚ$æE<Â÷`Ì¥ÃS¡ü·ù# ï%Úq_^Hl£ÏA·¢ßs¬¬QoÒÂÒXÕàÐ9<ÄèàkÄkÂw0iüÐJ¢$ÞI
Dâñã!újm¨¤6²À*ìÞ
þ"î@|è,LïEJdFùF3v³g¢ÈBuAT¡ÐX£.*7L4n5¾b8`|Ãpܨ6£ÉLêa(y9EÊx3Mð:5½F£\£Ù©Î&:¢WE?ýdt_ô©èo£Ùhke¶çåtSçùÚí 0ªÈA]M¸éssôyÚ$g£CôÐiõ<»þ®»&N¼ë®z|ñÜ9á²pùÜ9lÄ'
±üÌ\½*üBË
Gá¹²yðÜ'ÉâkÊAïF´÷±ÑacÛõ$gÅ&Ë\ÜLÍbæÙ¶àMÄVÛ3X¬Zëk^,k¥*>JÁèøíU|,EjØ=¦EvLhfGÝaß©uàÇ*ÇC'}So¬ÃYp}g ÚÁÙJ%²³:Qg~ôäm²ë ØC£Y&
k0ë0:ÊKF}þúÈƹ3W{bhá2[Ü
±¶¹ä¡®Û'ÆˮڿûáßÝ?jfçå¾Âåû»>½³}åä¹Æg\ ën>ÈRÜwêÙ'çðÖÀ2¾8ݨü¡]uNE¨Té>¦øtub"«mw°kRTM³Íw Ù);µYYYÄY}Y§²¾Í¢²rª²Ém0xQ;pz@ïõHòº)èå±hOjEa6áNm6åæÀ.ÝIμkÛeÌânFãܸa6GLï'çNrã ±#ÖOYÙ]öÀ¾ÿLBør8k
·u
ÏáÛúE8±´Ånñ¶.Zð`VÌ Âºjr˪Ûø¦VçæyyUCAôÿKáËc·¿æÃÀ-C·ñ&MTÔv½^oPk^3Èy³9N¿=n*o%ù8µ®Ð W#f-êO¨]ShÛÉ%T9´ÄM¯=eö·ôÁk3ÓVj_ÏÎÂñXÚnX¨# {Í% .7®×5ÏßÞ²pDÒKÚ/(Kݸ«AýYø!
ßÓäßÔÙþ¨ÚR
ÉûwÂÒ4æï¾8&ùäØç%0R3eü^n?y y#Ì*ä`ì4î D×á¸{Ë»fð?DL^|.If
2FùèÍÐVOÿ
l"ºÄßV-6p¾|iÚútÜ`4KS
òBó8mµ¹ÚÛ:¹¸5m)_q/¿5j³ys¦âmü1Ùsrsr+25D|QÅ]£ñè¢ÄN¤+"Ø°WV41Ü}=sWÎ(ÊWÖTMJsNLnCAÌsµÅì¡9Mfd;ÊîJÁ)E#uصÇñúÒ°7èüx Sïõz<ðþ8síǤޫÓ{¡rÒá
t^uĦØ)ƽÚc ®Î&Ô@X1Ú$:7¼Ùd.È%5XêÍÏÓör,£ÓMülf('G&é´`´Ù¤FÎø2¿a'õß²y̨Û×·ÿ°§udzëЪ(urú´¿þí~á
ÿ>ðá±éÕ§§7[Æ> ¼Ü/øÎúú'¼ýùCáëÝ3fcäºücµ\6v:®h
¾õûÇ_ÆÅ#3FÞ)LÈË5Äç}v ?õágáìó[*Ëuaâ/8WÿÛÓ¿>ý\8ØRÀêÕiµàH´ÎKrRP#þµÝ©'ÖXE*ÆnL(SݲX-o¨-°ÛÒ2ÑmMMûçYGKäQp K ?Xç¼H Ã¥
@Ð@ü&ï[VWq»ð ògÍÌkÆååIã:¥Tý±388A½&Û4Ïa¾),±e[fÁdÞ£·Y[Ò×
O~2lì|BêÑaþ!9be293ɬòä¥É]ºT}'+êç#¿nZ¬[îÓmCuÏ¢gto¢e?ÉciBnYä«·±6
Y:~[¢(Ñ/×ïïÓ¿©×D±Q
B¥×ÉeÖ°©W°¤FE)é&ôQj;Êj¨_fI gÕ=k k#FY1QeÍ ²à7á¢ÈÏÃm2©6|Pø¾Þ.Âgø0VÁ
y!í±Ô+iÔ{©¥]¡_»²L·?tþþl9Å¢RÞmRôÖh«m¦Kcͤ²4JKÄblñ²5.þãò´UâÕy$Ç^,9s>o!Y>×Å2ðÖCúÉsgμòʾB¼vn§pD8²óÓOwâb\¼ó\pôY¬ùù'%|÷óÏÂwËß}á
wñýxÓ[/¼ðp;©c©¡i
Ð<è)~¾,u°IV¥5Êì0'ÍUùÕ 2)Úc²-É=ÒXhÛo<n<k|'ý²á²égÃ/¦3¢4H§L°ØͲIJdN³-õDTiÜcÙìÑ'Ê6'Û²íùº[P~¦5+;ã÷8^¼¶éÁ°V_߸èíÃÂ]Äc-¬ÛäÛGpz ÂùfSÏ.Î>ò0^¡I©Ù×V~7ϯübÃ&Ìaóx¬Lx_¶ úö%TfüFØ9õá-á¢ð©°ßôçÜjÉ\96Ùï1jîo`æÇË÷&6Oè¶9G%iyý¸ð=f/RÉ ÷oÞ½gQ!¯ÂL*KP2ñ4@ÜÐO$ÒPñr
ÃÅ]áÍ×bÔ1ÆáSAG½%T Ç
ú5áÊ1Ô!L1)ø"XAKxØ'¨§:ãD=t1$MaDL&j=¬ë'iª}õ{@÷-µA¯þY%¼P}Pþb½^[+5cú#k3-:È#@CzÈ+äÂÇż¨ü îÛ%ÿVú*¤f"
ò¢èA~b¿¯ÇÔûñ8éïS¿pjÔ4h-ÍåL 22SÅ*FcJigSÊôåHC-ªÍ::ÉVl«ÉÕæYÿÀçjü{t¶×yO^¸ ©$'AGÌâÙÅZ,iÑà(§CøD)pÿ?¡(.ÚÍFG&vçx¢@d >)È⧻Ë&ñùâÄQMÓý©ñ
a@^Õ
{6l ÈØXá¸ZAÕÎèþíút¡×å*Ù]?¶tÁß(¢lcrsKñPUÕëjĸÂduªcd"÷ÜcǦgñépü¿"ô9õ8}|(ÊøD9§ÔèUÈêÒ³Jç¢Æ{[B]aW»VGÂVGÝÈ0x^2%º°¹ ½Ú%À`ÕKôáã!!Ì
ò¹¿{fm}¶WøÇæÖ.I\ï}ô±ñwÜEM/ýù\Ýzúppp|rÎÕîþ»Ç?tÍ0½UÛ ¹Ç¡x´¯xÔô¬XÇô-ú¥eúñMÃQ£ÌB0TÜ;+ÞÆ4
ö+Z¯Í²£üxsåvn£zÂ
>08(9)ê=X2í¡&]Ûå8Ñ/yZZ#ËÿP^llîýs¦È±S1å>8¯ÁêÿvÓ!88*»ìÁñw.»nÁ´ÕÝqáÏØ_âÒÞJ@§ý -ìn"~I
Y¡%Ù¥eEK¡46±nÊf·å«ñöêa[ ¹èÂÇtX*^I,MÈd
>Îâ$|M,TåÛÀæË£ïþ/á
Æg^¹Ó?º~åâ%˨·Ô²_ø-¾÷6cþj×ÞNqhãÁ=sy0`- èy~l¥lmôü¨b°f´´®ÔåîÃ÷Fµ+Hi6æjÙxÓxóXÛÓótÛYüõeÜÜv®Ô®£×h)7ñ¹535íR£a\ Ö¬OQH"Ì7¯HoVÆæ"ìMñV§kë4¢r6<ÏxÂì8 ÒRkG.[ÌÄÎ|pÜfëøxIªÀ"a|ªM²+n]÷aoPAã+ÜPoÆNåôWO ¯aû
h²ûù/Ï]è[Usÿ²,ß'°
gâÎÍ à
ôÛeß]êÒAâv¼)4
8&è1ÃL%ÈØ£xªA
d
^3²d¯Ôð ÆÜ0fGô̶"²b
H Íø«Áóà{OFâ11(höo231Þko)y¹¢ïÉ%rÒ,i
çx¼hô ý6I^"É<+>võa²ýXåKe³2¡º
d®"Mçµ½ñà iWÕ3jWJ ð\,nÊ"ìz{¢;+Ý`˽'&CnÍΦ²Cvwì.x²dÀÄè$£s]óÉéAPyDbjuM`£°$0BäC½knöÆs#7+GµðØXµ\xæmá'
ÎWÅd¶mâºî+ß}rëW[~ìw÷×,9®ì²¦-ú+?¼{{`ÇS9&÷¼²Ç++¥Ø}õï¸Ëo-¹·Ñ@%|"M¸M6ëÊefµ¬BOÛ¬[gµX_uÔUÜ9§hÒé`¿/xóÎaÈÉ1OçeDÇ»Û:çqì触S·UVºûà*bKCþø-ýô¡à3k¾p=¶~KÊ
øñh<f`
ïmbÂ"bÜ
P ÿIÃ2ÄéRIÁ§Tv çÀH
:YaÉWE<r¡¿$á1 ¢ ªO; ¢§ÐÔ´ RkÁ/
×øl )"°ñ,
'HdáÔ´¬l G8eîÆàÌsãÏq0Ý-KMÃq«_Uç&¸C@þG½xeÂÓÿIà°VÚ@Jô-ßÏôiâAEAü>^O?({@þ:}\ö{NvV®1ÆCz¨d:) ñd5ÓD62óÉVf)µÚBnb#_¢v1ϲûÈ õy²Õ0ãØiô:jì}Töùõ1û¡LIËåM3J%%#X YVI¾¥§(" !ä
d¬r+ÜHÉ)³¼RZU굺ÏEî_m4⯢AB" f¯QR ´¼ñÒC×µV++fµÅᤵ³ ÒxìCHÄ꫱ÏÆ>a
Þ(<+\é.Ñ®^ÄÛÁ9øÂs"¯6ÀëYé¾47@&Ý
"7¶Rtÿuí¼¦ÙY)âgEYs«U°Ç©é(½VZ
@0ø#|¡´¢(4¼Ò2¨C½ê^¨
úQ¨²Ê -
Ìw@JÇR#¬iÞ ·vX5¼ü'âéä8
¾Ð¡Z
P<D'qàóâÁ×AÖ°^ª¸Tß;¼c»ÇòápI>ÞHFleÌ/:Ò
LkúÁÞLÎF'ÍíÉ0E»Z§Á0yÉ÷¬¬ÍNnX@M/¬ôN.Dì<~Ö±Á%Äz_úÄÁbª/ÈkÎï¯EÿáiÖð46Bî^-JëµÜRé|sa³êÄ<òOräzº3PâxîÅMöÎS{Î_b(ÈnXÛJÍØ}2@l9eE©?Ø$NI,#¥xºâªÇ)¿tlAµ|²Ä2Õ:Õ:-iV[¢æªIÚef.ÒbùfM¯³ZmܳçÆ
÷P"»PWgÄKKJFo#&.ßwßÊk×®$2
¯
/àùGCHdÅÑÁwõïر{÷ýs
çñÔo¿ÆÓ
g¾&xàåJa2µ
tS
Z1O1dXáâX£pVLmQçkm =Ænq[o·JQ%
Iã»<½HìSÉîEwC¤ê/®_²2Ë
àÕÓ<ÂE¬Ë1s
ñÄk¿°ôÕÚÀ£2»o ·{§ÔU<J³½*eì#®P3ÀÛH\µRCîTÅ[
¬Q)8"à Ò(æV¥[e6ð=º¹×Ôàt8ký7Ð~z@:ÐuÊÜ ÄÇJCLªú
"ðÖe×¢f`,%5Á©²WÛÇϲÐGT"åCiÜÉOk$«S¥²F1VYå(çªSβ¸N¥ LiÉ®g©ôdezzQÁÅjXc[hËT¡¸Ú¨TfÍÈ]wGCévøN.¬$ÁÚ¡ÞÖt¤ÚRh(%ÚâeêðD[2d8:¸$ȳgËãó2;9Yí?~àÌ
ÂçÕÓÙ4155´kê¿]Bßã+xÆxÎeµÊ¬(_µå§p#Gº³MæÂäIõ+~wògIRÓB_Ké°Óqû´éQvUºî îDÁX4ÁDÅRr7zäd `ÍG³(ÓÅçÅÒéú×ZéN% nÙY<ñwNb°Î,{Ó·bÛîÝ ·¨ã4kË"ÛÞÂY©·Ç80~fk³çOúºDMßaß8ÏTl¦TS¡iëg8I£VÐßØÔ
·Í¢´Åâ|5&öºóíRï
x.f6%¢óÀºèÈÍF$7¯êDÒñv»ëüX1+ 8ÓS¾âÁ_SeÁö&Ïä¾tåÕmt¡8ý÷Íãw6ðq·Peà¿ÿèf¦0s§R-QCËnr ÅTáuh-½
ÞI¿B%¿À?b9ÉQ[g,Áaìîñ!
nTá[tË òÛzÃ5öáw+8Pn 'râmøÈD#,ÞaþsªìÊ«TÙÕ³°
}Ø
y(5
""
¢¨FüÔø b× UcRBD¬ÅgvááðÑó/
ïÿsv²oHÝWIÛÕ ¶E£PEÕ¬
(}Àw¸Ós;ï* èÀËêÔ,§Åf«kØ&V<A°ËÔjsT¾z|r«2 d¢£U25Á©Tn½R©`XÂ.¹õr¹ªFR.cX9²é»6*5RkÔJðw( æUGdV¿Ñô\X»{@Î[®B>,¡kÚóÁó%a¾V-þr×/]þznù",Þªy® ÀMâoÜa.áHp·+u¼ÍF xÔ_?áwÁ23ñê· B¡×-PQ]WÏ®+§ã0ÉD#)®@ä¿ä?<Â7ËÈjòIP
ò{¨
ò÷¨äK%Q)ò|Ê+¯ 3Z#ÿ-õ|õ¼|/uPþ'ê/òórÝfêQ9!'EÉä¤=õ$«NAÉ1IÛTXfùÓ1vàt"o<âµÚ`±ô»¡µ+EMëïy#vy¬"»gâ« ÏDrú¥à¦¯>!îÀtø\(¡_¢?A#ÐëáÛ°)¦ ª Of f<¥1`"vðRv%J%iPæBérÍP²ÊbôÕTJ£÷ ¸
h
;¬+Æy°`@ÉÙM}Tc)? e¥"YrlI:jùAí<m(È/´â}W.nÛ¹>a¬Ctåá97~Xú#E
¦b}ªËW¸ðá:FZ+7Ü_oÒâûS
Å$»R
ZÓ´{×U%Û0J¾~_øþѺ$üÔ±¬Ìc{í\ß;?¿y]bæ1Ï<Ï17·b¾·ëoÂ;'nß$ñZ*G~JûyfTñ÷H'þ?øïÜ«Äú
Ç~þñìU·òNy%ÌýÏyx3)B
B*ã)ïô_ÿI¥ßDÛ/¨¥õ»êUÔ"ÄGJ¾X ÿ{½ s·3»ÐÆüô44ú9 öÀ"W@]u ÔøÆpµ°æz(rè_
õ6p@d|%VÃXYä Û j#
Ç *ÐÛ)ïÃæÀpPöä?5õBôÄ¿SÁ¸r!¥
Êà!5|£jÎ#µ¤üßFôÀK|c¸¡h@k5`ÜÜÖ°EÈ:!Ð`»PÌbaÝØKÅ[ |*þíÄåTÈ5ähd ÒJ÷#½¤p£ÙDý,ê`Î5yÉQCoW|ÜÃ)ðõë#0dí@ÒÍËÌ¢ôtÖCÿIGaýMÀÌ"0 Ù*SH¹ö^YÈgx &¤þöÌ¢b¼RIè7àýEYøSð¨ñÕÌ¢\i±Yй: f _GÜa"¼GVÄE¼)Á2B",ç& ~-éhWtlú+ ¦Á^r*Ð)éF%ÁbÞÈs%XF.`¹wK°RÌ-É;$X%õ?"ÁZõä6 ÖIý/H°^Zs¯$øGKða sU
ù[V§%8Nó©ËÂt^aBÿU~föìBös\¶×Ïvt,ðscÚÛ:wû»¸±ggrâxa67¾ua{÷²?WZÅE>ñfsSÄÉíw·¶/\$}Pí_Ðãïní+k_0g¨Á"îÒ"ðy3²ò3r¼7ÎjeÜø}ë"ÎÇuùçµ.Êüs¸î.߯ëv®}îpân\m|ë<_÷â.ÿ"ØS[ûBn¨=aqÛ,ØáäßìÖ
óêºÚ;Ú»Ä=øpá¡EµíÛÁ0àô\_ºªË7¬å_ÔÍ-^èçÎÅ~nÎíÚsº*ýÝÝÃ÷ÕÍù[e65Ͽп»Å×å_¸Üß:»Å¿êºùJóE°Zûõ®Ù-ívuùºÛ¹97_ªxì|óüýþ®9\oáÿEm~`uoá?ÝÓõFoD]Ãîüða×â¶Þ®ÛýÜÜmÿd3üóºZ»}ÃÙ
{ñIñ
meïòÕ/¢½Õ/aõ·½*bín]Ðò_ÀÒöí_|è3Õ¾=sokAk¯g}z÷Ò¯·¾}âÝã_\o÷Ùù;¶l;
ñÍÞÝð¡vðïsÀ/ V©´×îR
ºà½h¨^ùª=ëlÖÕ¬ËYÇþpuïÙ{!þûÃkN÷<ß*ΣLT!ÅS£¨Ê¿cËÄÉ7ÌñIÞUú Et12 j(£¡H³Ý2þ ~E7
òrl§Òþ¾õ¯¯âLð¥wÎäUj$½¦Ø>{ÍêR9®AÜØqrIuy¿ëy{ îw9¡®þ8¨//pÙ³ìW2ÌÇØrýÆþ#\%öï]Ùö·aÞ©*ûÉRï·¿ :î
P²uÝm¹ ž· ÈÞï¾~ûîR¨öÙwÜmê©çw©RµÝÀÛúíOÕ>û°þæ5ÒÀ¦ð«ÃUÇ=¢ö=RµpOx~½Ídb^ior-°ÏpyíSJ8±ß^ëðw´×J+ðaDùáÕó\Å9a´é®öä0q6o°s®ñö8X?ýÉÍ0|½45}¥:9ÕUíÞÀ±Ú®«ÙîCx'hM
ñ£{ªSfüP¿}
TÛöT'$ÈK¼Þ¾Ç]í¾J>D(Sx
Înaç°SÙ\6Ma pgcØh^¦id*B&A #dBòiâyÍhÅ¡¤^µø&ÂQeº×ÔSb)ÑÖy+Ëoòj¼Ó®ÿXÃ8®osÍä¾]q}9"kLû÷üøËàUS¿lOý²ËÓ¤¿qVø¡4÷mèi±ôÅq»//üqORó¬Ù-bíó÷-súËû.;˹ÝõÓn2<M®wïFÓ*¦4ìÆûËûëùzñï[÷ÔUTO¸×ú!\Õ7Y¬B\¬ZÄU7á&ÃÄá:××W_'áJK«h\þ¾öÃ
endstream
endobj
22 0 obj
10225
endobj
Here is the PDF file that produces the reported issue: issue7445.pdf
@LeonMelis For your information, PR #7475 added a parameter (disableCombineTextItems
) that when set disables the heuristics that attempt to combine text runs. I hope this is helpful in your use-case!
This may be solved by implementing something like this: https://github.com/danigm/poppler/blob/0011805e22193b690b99a53dcb9986ce04eb3eb4/poppler/TextOutputDev.cc#L1210
First of all, I'm not sure if this is a bug or intended behaviour, I searched the issues and can't find anything related. I read the guidelines but I really can't attach the PDF in question, it contains sensitive information which is under NDA. I am allowed (I think) to post snippets of the content, so I hope that that is enough information to reproduce the result. I can also disclose that the PDF in question appears to be created with 'Prince 9.0 rev 5'.
Configuration:
I use PDF.js to extract text data with
getTextContent()
, I ran into PDF files where bold text (or at least thicker than rest of the text) seems to be created by placing the same words over each other with a slight offset on the X axis. I have never seen this behaviour in other PDF's, so I only assume it's done to create the 'bold' effect.The following content contains the word 'Omschrijving' (dutch for 'Description'), which as you can see is placed twice, with a slight (0.18px) offset on the X axis.
Running
getTextContent()
returns this as a single object with str = 'OmschrijvingOmschrijving':Also note that the width returned by
getTextContent()
is correct for a single string 'Omschrijving', so that too is a bit confusing.This expected result would be two text items, with identical text, identical Y coordinate (439.8596) but different X coordinate (80.7501 vs 80.9301), like this:
Again, I'm not sure if this is a bug or intended behaviour. In case of the latter, any hints to what part of the pdfjs code I should look into to achieve the preferred behaviour would be really helpful!