mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.52k stars 9.98k forks source link

getTextContent() concatenates overlapping strings #7445

Closed LeonMelis closed 4 years ago

LeonMelis commented 8 years ago

First of all, I'm not sure if this is a bug or intended behaviour, I searched the issues and can't find anything related. I read the guidelines but I really can't attach the PDF in question, it contains sensitive information which is under NDA. I am allowed (I think) to post snippets of the content, so I hope that that is enough information to reproduce the result. I can also disclose that the PDF in question appears to be created with 'Prince 9.0 rev 5'.

Configuration:

I use PDF.js to extract text data with getTextContent(), I ran into PDF files where bold text (or at least thicker than rest of the text) seems to be created by placing the same words over each other with a slight offset on the X axis. I have never seen this behaviour in other PDF's, so I only assume it's done to create the 'bold' effect.

The following content contains the word 'Omschrijving' (dutch for 'Description'), which as you can see is placed twice, with a slight (0.18px) offset on the X axis.

BT
80.7501 439.8596 Td
/F2 9.0000 Tf
<0D1D22131821191A25191E17> Tj
0.1800 0.0000 Td
<0D1D22131821191A25191E17> Tj
ET

Running getTextContent() returns this as a single object with str = 'OmschrijvingOmschrijving':

{
"str": "OmschrijvingOmschrijving",
"dir": "ltr",
"width": 57.63600000000001,
"height": 9,
"transform": [9, 0, 0, 9, 80.7501, 439.8596],
"fontName": "g_d0_f1"
}

Also note that the width returned by getTextContent() is correct for a single string 'Omschrijving', so that too is a bit confusing.

This expected result would be two text items, with identical text, identical Y coordinate (439.8596) but different X coordinate (80.7501 vs 80.9301), like this:

{
"str": "Omschrijving",
"dir": "ltr",
"width": 57.63600000000001,
"height": 9,
"transform": [9, 0, 0, 9, 80.7501, 439.8596],
"fontName": "g_d0_f1"
},
{
"str": "Omschrijving",
"dir": "ltr",
"width": 57.63600000000001,
"height": 9,
"transform": [9, 0, 0, 9, 80.9301, 439.8596],
"fontName": "g_d0_f1"
}

Again, I'm not sure if this is a bug or intended behaviour. In case of the latter, any hints to what part of the pdfjs code I should look into to achieve the preferred behaviour would be really helpful!

Rob--W commented 8 years ago

I'm trying to create the PDF file based on the info you provided, by hand. Can you provide the font identified by g_d0_f1 (this is an auto-generated name, the actual font may be different). In the PDF, the font is identified by /F2. Look for something along the lines of <</Font<</F2<<...>>>>

LeonMelis commented 8 years ago

@Rob--W recreating the PDF by hand... man, I really admire your help! It really sucks I'm in no position to share the PDF :(

Anyway, from the Page object I understand that font F2 is object 7, gen 0:

3 0 obj
<</Type /Page
/Parent 2 0 R
/MediaBox [0 0 595 842]
/Contents 9 0 R
/Resources <</Font <</F1 6 0 R
/F2 7 0 R>>
/XObject <</I1 8 0 R>>
/ProcSet [/PDF /Text /ImageC]>>
/Annots [11 0 R]>>
endobj

That should be this part:

7 0 obj
<</Type /Font
/Subtype /TrueType
/BaseFont /PXAAAB+Helvetica,Bold
/FirstChar 1
/LastChar 41
/Widths [277 277 333 277 556 556 556 556 333 975 722 833 777 666 610 666 556 610 556 610 556 333 610 610 277 277 556 277 889 610 610 610 389 556 333 610 556 777 556 500 744]
/FontDescriptor 18 0 R
/ToUnicode 19 0 R>>
endobj

Extracted via getTextContent():

"g_d0_f1": {
  "fontFamily": "sans-serif",
  "ascent": 0.77001953125,
  "descent": -0.22998046875
},
"g_d0_f2": {
  "fontFamily": "sans-serif",
  "ascent": 0.77001953125,
  "descent": -0.22998046875
}

Again, I'm really sorry I can't supply the PDF!

Snuffleupagus commented 8 years ago

I ran into PDF files where bold text (or at least thicker than rest of the text) seems to be created by placing the same words over each other with a slight offset on the X axis. I have never seen this behaviour in other PDF's, so I only assume it's done to create the 'bold' effect.

To me, this would suggest that the bug is in the PDF file itself, i.e. the PDF generator is creating a broken file, since that isn't how you're supposed to create bold text. The usual way it's done is by using different fonts for the regular and bold text. (We've also seen a case where different text rendering modes were used to create "bold" text.) I'm assuming that you don't have control over what software was used to create the PDF file, but if you do I'd advice you to use better a PDF generator.

Again, I'm not sure if this is a bug or intended behaviour.

In this case, it's probably intended behaviour, but let me try to elaborate. Please note the primary use-case for getTextContent is to enable text-selection, copying, and searching in the e.g. the default PDF.js viewer. Since many PDF files unfortunately uses separate showText commands, with moveText/setTextMatrix commands in between to simulate spaces, this often leads to less desirable text-selection behaviour. In order to avoid this, we're using a couple of heuristics to attempt to combine (same line) showText commands. Most likely this explains what you're describing above, please see e.g. https://github.com/mozilla/pdf.js/blob/e908b71309bddd9656cfd6bfc86054eaab3c1974/src/core/evaluator.js#L1419 and https://github.com/mozilla/pdf.js/blob/e908b71309bddd9656cfd6bfc86054eaab3c1974/src/core/evaluator.js#L1452.

Rob--W commented 8 years ago

18 0 R and 19 0 R are also part of the font definition, please post them.

LeonMelis commented 8 years ago

@Snuffleupagus Thanks for the explanation, the whole 'fakespaces' thing I wondered about in the source makes sense now. I have removed the 'optimization' parts from OPS.moveText and OPS.setTextMatrix as you suggested and now I get the expected result!

@Rob--W Looks like you'll need object 18 until 22. Object 19 & 20 are compressed (FlateDecode) streams, so I'm affraid pasting that binary data here will not work, but I'll give it a try anyway. I have also attached the decompressed objects 19 and 20 to this post.

18 0 obj
<</Type /FontDescriptor
/FontName /Helvetica,Bold
/Ascent 770
/Descent -229
/ItalicAngle 0
/StemV 0
/CapHeight 1474
/Flags 4
/FontBBox [-1017 -480 1436 1159]
/FontFile2 20 0 R>>
endobj

19 0 obj
<</Length 21 0 R
/Filter /FlateDecode>>
stream
xœ]“Íjã0…÷~
-;‹bKI¤‚ah)d1?Lf@–®SC#ÅYäíGG't`1ŸÐ=ןä›öåðzHӪڟyGYÕ8¥˜å2_s5ÈiJ6*Na½¯ê3œýÒ´%|¼]V9Ò87û½j•Í˚oêákœùÒ´?r”<¥“zøór,ëãuY>ä,iU]Ó÷*ÊX}óËwÕÖØã!–ýi½=–Ì¿Šß·E”©kM™0G¹,>Höé$;ëzµ{ëIñ¿½í=2ŒáÝçRªKiי®/lȼ!Gð–,à]åM­·dvä
ø‰¼?“=ØWÞÖì@®Ù@®ïŠä,•wµçHÞÖقéo5˜þ=5ý-²šþ¶féoá¦éokú[¦¿}Óß>ƒéoqM;€éoqošþgÑô·¸7M‹sú;܃¡¿ƒ³¡¿ƒ³¡¿ƒ³¡¿ƒ³¡¿ƒ³¡¿ƒ³¡¿ƒ§¡¿ƒ§¿é|¨ÃpÿêLî缅kÎeÔêx×ÃtMI>ÿ˼ UÉÓÏ
endstream
endobj

21 0 obj
405
endobj

20 0 obj
<</Length 22 0 R
/Filter /FlateDecode
/Length1 15332>>
stream
xœÝ{y|U¶ð½µöž^Òk–®NwÒÙ;YšR Ù BXLHhH…6Q¸Î°ˆ¾§Œ(Š:†€ØÀàcT@xƒŠƒŠ>QÌè8q…®~§ª›æcæçó{|)nÝs—ºçܳÝsnÂ!Z…ÈÐÆÙm¾Ž£ï~üôœ@ëg÷tsóJSx€?Eˆ¼onǼ¶¿{ÿ¾!º!…kނesëþòJBQâÆ·ø}sŽÍ»€ç}X#¿:ä÷±Yei¡íjië^z HSíÐ~vAûl"j†öYh¿Ðæ[Ú!û­RPö:hs}mþ®ã¿ã ý,B̸ŽöEÝ¡äG¨`Pïèòwë×ÏE¨0èû3ôa$îGÜŒØ&HŠfX™\¡T©5QZÞm4™-V[Ll\¼s$8]‰Iîäôÿé}éè×P½ÅQe(¡Ð‡P>karèúR„‚¡$‚Ärî
ŽF/!½‚îDzíÂräD8}€ãp*z    è#ô?Ȇ6 'á].á}‰“aN>Zþ=ê@¨žK˜FFTˆ¾­
ýŒÊP/:‚YlÀq¡ýȃÖ³
=ŽUĬÐndAãÑÐÆÕèú0Ôú
ÖÏG°{¨¢ÐLj@4ôxÑz´½‚؉Sñ­¡Ðog ]¡ÚP|÷
Ìò   h`ûÛqNÃÛð'ä@hUèØ[,ŒME³áiCw£­èqô¢4kKaýrTc€\B߁R¥à2¼”xüŠü–*¢¶…ŽS_3z“ÀžŠçàü"ދÿ„ 
é%ߣ:¨í@ÛT´mG‡Ð›è]ô1ºŒÐ/(ˆ) i4žˆWàÿ€ïþ‡È%šˆ•ÄFâCâ2›ü„b©
ô½ôÁz/ôЏRQªB“P؀ÍEÑbtº³hڍþԞCç°k±gã*<ߊoÇËÐÃx>€Ïâóø"þ¨3vÂIxˆÀ·šXO¼Hôû‰RGv“+ÉÃä'䔑j¢ÃsŽÎ »™X¦†$üV8Ê=Úr1ÁãB)(Æp±
Ý’\<{í@ϣߣ~Ôº‚½èzèú}ƒ~‰ÅÂãÀ9¸×áI@á܆ïÂ[Â]xPyDgð|Y 9‘AÜJøˆeðlC[‰w%þ¨H™Lf5ääÐßÈÉÝäwT"5ê¤VP½ÔVê  :–EßBO§;èMô>ú-ú/ô7ô Ǭev0{™wY;‚ÝÊ
8háp"ڋ^­ÛLv@ۅÆà{@ªÓÐ    ÐÞô:º‚~F‡ÑN‡R”fRh;
„ց4¡—É;P1z˜ø
1.TB>KÊqNèGX+äuíA|jJ²;)ÑåLppöø¸Ø›Õb6£
z6J£V)rËÐI`”^á¬læú’šû¨$guu†Øvú Ã7¬£¹ƒ®ÊçôqÍÒ4îƙ<̜û3ùðL~h&ÖrŨ8#«pr}'˝\OŸÔðýåÎF®o@‚k%ø!    VìpÀ\…¥¥œëÃÍ\E_eOKoEs9,·Ÿ÷£ÈHGûa÷H).܇ÆøV¶X gTôٜå}Vg¹4F&VøæôÕMj¨(q83Òûð˜ÙÎY}ÈYÖ•ù\üŽƒ©õ
€;#½U¤mPÍqÎÙàѬfòÍhè#}}D³ˆC—Ögv–÷™—_°\o^ƒ*6ì#+}þÞJ`͆êp³Ylù6B«f2Ë÷66ôá{ÃDH´‡wáwVˆ=Íó¹>¹³ÌÙÒ;¿xŽêúm¼­ÂÙ\Þ؇êú­¼Ujd¤ï·ÜYä¦ìÏ(Í(ë"‡åÎpýŚpÿ;‡•Ò¼#ŸB]S?Ä,brŽ2û¸Ùœ¸®h-_þBÔ;»¦ÁO#†]¶=cúP%2±Nëë[59B†¯¥<BÜüò~¹Õ&¬æ7÷jG˜¯ur½ß#¬sàë{|‘&Qû=AQþC*ã×à‰1":‹³E_OE¤í´Të€6t–gPjzMÉëvcü@c‡î
 ò¸ýpÀ3oƒá4QáZË4ÒÓ¡#Õ`­„u*EÍàz¹Þ±sz¹J®TŠJ”jð÷6z€a“€-hJƒ£oŒý#aLqJZ§·V˜Ya¾´,„IžôØUR]䆾Uå1}|y#04ôp]CßaSc#ÌÊ¢ꕭ–ÍÙ@sV*9áU&ðDcoo¸åtôîíé--Ü`ô|¤#€¤€¼ªNZåtÄH,v8@V£ÈÓ\Pàk
@#þ5‡ó†s8¨Í“8\ðoâpá¯á°÷WqxäÍ9\49\üÇáQ7pxô¿æpÉpó@m‰ÄáÒ‡Ë~
‡Çü*—ßœÃ@s¹ÈáÊÿ;WÝÀáêÍá±Ã9<¨+q¸æßÄáñ¿†Ãµ¿ŠÃnÎá‰@ó‘Ãuÿwžt‡ëÿ5‡'çð v²Äá©ÿ&Oû5¾åWq¸áænšDOâ0Ӈ†sxÕ?0ýÛY~ë0–C¦°
2¶Q¯‘ïæ4ñ ÅƑHASq$IØä‡‘U&ßåXPlIK›0X\,ž ý¡¸V,F%ÅÁb±dgåê:7”mÔS«'é×~ ê¯ü^L
1Ꞇx;©QoGUÈ(u'L)¢Ìf›\ANg:eÕDu8îÂ1àõN¨ð—_¬E%ƒ:¯w ;Ë<8E4Ãj°³gÚ$æE<•÷`̀¥ÃS¡ü·ù#‹Ÿï%Úq_^Hl£ÏA·•¢ßs¬¬QœoÒÂÒXÕàÐ9<ÄèàkÄkÂw0iüÐJ¢$ހ“I
Dâñã!újƒm¨¤6²À*ìÞ
þ"î—@|è,•LïEJdF‹ùF3v³g¢ÈBuAT¡ÐX£.*7L4n5¾b8`|Ãpܨ6£ÉLêa(†y†9EÊx3Mð:5½F£\£Ù©ÎŠ&:¢WE?ýdt_ô©èo£Ùhk•e‚¶çˆåtSçùÚí0ªÈA]M¸éssôyÚ$g£CôŽÐiõ<»þ®»&N¼ë®z|ñÜ9á²pùÜ9lÄ'…±üÌ\½*üB˅Gá¹²’yðÜ'ÉâkÊAïF´Ž÷±Ñ˜ŠacÛõ$gÅ&Ë\ÜLÍbæÙ¶àMÄVÛ3X‰¬Zëk^,k¥*>JÁèøíU|,EjØ=¦EvLhfGÝaß©uàÇ*ÇCŽ'}ŽSŽo¬ÃY•p}gƒ   ÚÁÙ‹J‚%€›²³š:Qg~ô䍀m²ëؑC£Y&
k0ë0:ÊKF}þúÈƹ3W{bhá2[܅±¶¹ä¡®Û'ÆˮڿûáßÝ?jfçŒå¾Âåû»>½³}å乓Ɨg\    ën>ÈRÜwêÙ'çðÖœÀ2¾8ÝŸ˜¨ü¡]uNE¨Té>ž¦øtub"«mw°kRTM³Íw Ù);µYYYēY}Y§²¾Í¢²rª²‡Ém0xQ;pz@ïõHòƒº)èå±hOjEa6áNm6åæÀ.ÝIμkÛeÌânFãÜ“¸a6GLï'çNrã    ±#ÖOYÙ]öÀ†‚™¾ÿLBør8k
·uœ
ÏáÛúE8±´„Ňnñ¶.Zð`VÌ  šºjr˪Ûø¦VçæŸyyUCAôÿKáËc·¿æÇÀ‡-ˆC·ñ&MTÔv½^oPk^3Èy³9N¿=Š‚n*o%ù8µ®Ð W#f-êO¨]ShÛÉ%T9´ÄMŸ¯=e‹öž‡‡·ôÁk3ÓVj_ÏÎÂñXÚnX¨‰#   {Í%‹    .7®×5ϘßÞ²pDҘ„šKڃ/(Kݸ«AýYø!
ßÓäßÔÙþ¨ÚR
ÉûwŸÒ4æï¾8&ùäØç%0R3e–ü^n?yy#Ì*ä`ì4î D—×á¸{˜Ë»fð?DL^|.If…2FùèÍÐVOÿ
l"ºÄߒ˜–—V-Ÿž6­p¾|iÚút™Ü`4™KS
òBó8mµ¹Úۘ:¹¸5m)_q/¿5j³ys¦âmü­1ٌsrsr+2Œ5D|QÅ]£ñè¢ÄN¤+"Ø°WV41Ü}=ŸsWÎ(ÊWÖTMJsNLn™CA›ÌsµÅ숡9MŒfd‡;ŠÊîJÁ)E#­u“صÇñúÒ°7èüx Sïõz<ðþ8síǤޫÓ{¡r҃á
t^uĦØ)‰ƽÚc ®Î&Ԅ@X„1Ú$:””7¼Ùd.È%5XêÍÏӂör,£ÓšMŽœ‚ülf('G&é´`´œÙ¤‹FÎø2¿€a'õŸß²y̨Ûד·ÿ°§udz”ëЪ(urú´¿þí~á…ÿ>ðá±éÕ§§7[Æ> ¼Ü/øÎúú'¼ýùCáëÝ3fcœä™º„ücµ\6v:®h
¾õûÇ_ÆÅ#3FÞ)LÈË5Äç}v–Ž?õágáìó[*Ëuaâ/8WŸÿÛÓ¿>ý\8ØRÀêÕiµàH´ÎKœ—rRP#Ÿþµ“Ý©'—˜ÖšˆXE*Æn”šL(SݲX-o™¨-°ÛҐ2ÑmMMûŽçYGKäQpƒK‚’?X爼H ™Ã–¥
Ž@Ð@ü&žï[VWq»ðògÍÌkƕååIã:•¥–Tý±38‡8A½&Û4Ïa¾),±e­[”fÁdÞ£·­Y[Ÿ™”Òך–
O~2lì|BŒêÑaþ!9be29­3ɬò䖥É]ºT}›'+–êç#¿nZ¬[îÓmC›uÏ¢gto¢e?Éci™Bn’Yä«·±6…YŸ€œ:~[¢(Ñ/×ï—ïÓ¿©×D±Q
B¥×ÉeˆÖ°©W°¤F‡Œ”Eƒ)é&ôQj;Êj¨_fI› g՞=k  k#FY1ˆŽQeÍ ²à7á„¢‘ŽÈÏÃm2Š‹©6|PøŒ¾Þ.Âgø0V‰Á
y!í±Ô+iÔ{©¥]¡_»²L”·?t†þŠþl9Å¢RÞmR™ôÖh«m‹¦Kcͤ²4JKÄblñ²ƒ5.þœãò´ˆU‰âÕy$Ç^,9s—>o!YŠ>×Å2ðÖCúÉsgμòʇ¾B¼vn§pD8²óÓOwâb\¼ó\pôY¬ùù'%|÷óÏÂwËß}á…wñýxÓ[/¼ð–p;©cŽ©¡i
К‰<è)~¾,Šu°IV¥5Êì0'ÍUùÕ­ 2”)Úc²-É=ÒXhÛo<n<k|'ý²á²égÃ/¦Ÿ3¢4H§L°ØͲč’²džN³œŽ-õD“TišÜcÙìљ'Ê6'Û²íùº[P~¦5+;€ã÷8^¼¶éÁ°V_߸èíÃÂ]”ˆŽ›Äc-¬ÛœäÛGpz›Âùf„SÏ.Ν>‡ò0^¡I©Ù×V~7ϯübÃ&Ìaó—x¬Lx_¶ úö%TfüFØ9õá-á¢ð©°˜„ßôçÜjÉ\96Ùï1jîo`æÇË÷&6O›è¶9G%iyý¸ð=f/RÉ ÷ˆoÞ½gQ!¯ÂL*KP2ñ4@ÜÐO$ÒPñrŽ…ÃŽÅ]á̓×bÔ1ÆáSAG½%T  Dž
ú5áʕ1Ô!L‹1)øŽ"XŸAKxØ'•¨§:ãD=t1$MaD„L&j=¬ë'iª}”•õ‹{@÷-µA¯þY%¼P}Pþb½—^[+žž5˜‡cúˆ#k3-‘:È#@žCzÈ+˜äÂǂÅ¼¨ü îÛ%ÿVú„*¤f"
ò¢‘èA~b‘—¿¯ÇÔûñ8éïS¿pjÔ4hƒ-ÍåL ’2“2SÅ*F™cJigSÊôåHC-ªÍ™—::ÉVl«ÉÕæY‹Šÿ€­Àçjü{tƒ¶œ×yO^¸ ©$'AGÌâكÅZ,iÑà(§CøD)pÿ?¡(‹.ڏÍFG&v‹çˆx¢@˜d >‰)È⧻Ë&œñùâĄQMÓý©ñ
a@^Չ
{6l ÈØXá¸ZAÕÎèþí›út¡×å*­Ù]?¶tÁƒß(¢lcrsKœñPUÕë‚jĸÂduªcd"Ÿ‘÷ÜcǦgñépü¿"ô9õ8}|(‡ÊøD9§ÔèUÈêÒ³J碕Æ{[Bœ]aW»•VGÂVG݄ș0x^2‰%º°¹ ½Ú%À`ÕK†ôáã!!Ì
ò¹¿{fm}¶WøÇæ—Ö.I\ï}ô±‘ñwÜEMŽ/ýù\Ýzúppp|rÎÕîþ‡»Ç?t͒0½Uۍ¹Ç¡x´¯xÔô¬‰X‹Çô-ú¥Šeú€ñMÃQ£ÌB0TÜ;”+Þƚ4
•ö•+Z¯Í²£üxsœ“å›­vn­£zÂ
>08(9)ê=‚X‹2í¡&]ۆå8Ñ/yZŸ“Z™#ËÿP^llîýs¦È±S1å>8¯ÁêÿvÓ!†88*»ìÁñw.»nÁ´ÕÝqáÏ؊_âÒÞJ@§ý -ìn"Ÿ~I…Y¡%‘Ù¥eEœK¡4’6ƒ±“nÊf·å«­ñö­ŽêŠa[¹èÂÇtX*^I,MÈd
>Îâ$’|M,’TˆåÛÀæ„Ë£ïþ/á
Æg^¹Ó?º~åâ%˨·Ô²_ø-¾œ÷6cþj×ޏNqh㖗Á=sŒy0`-   èy~l¥lmôü¨‚b°œf´´­†®ÔŽåîÃ÷F­µ+Hi6˜æjÙxÓxóXÛÓótÛYüõeÜ܏œv®Ô®£×h)ˆ7ñ¹535íR£‰a\  Ö¬OQšH"Ì7¯HˆoV­‚ŒÆæ"ìšMñV§kë4ƒ¢r6<ÏxÂì8  ҔR–kG.[ÌÄÎ|pÜfëˆøxIªÀ"a|ªMƒ²+n]÷aoPAã+šÜPoÆNåôWO    ¯aû…h²ûŽù‹/Ï]è[UsÿŽ²”œ˜,ߜ'°
gâœÎ͌ à
ôېeß]ꁉÒAâvˆ¼)4
8&è1ÃL%ÈØ£xªA
d…^3²d‡¯Ôð­ ÆÜ0fGô̶"²b
H ͍ø«Áóà{‹OFâŽ11(h‚öo231ޘko)œy¹¢ïÉ%rғ,i
çŸx¼hôý6I^"É<Ž+>võa²ýXåKe³€†2¡Œº
d®†“"Mçµ½ñàiW–‡Õ3‰jWJ—ð\,—n‰Ê"ìz{¢;+Ý`ˉ½'&CžŸnÍΦ²ƒCvwì.x²dÀÄè$£s]óš’ÉéAPyD”bjuM`£°$0BŒäˆC½knöÆs#7+GµðØXµ\xæmá'
ÎWÅd¶m‘â™ºî+ß}rëW[ž~ìw÷×,œ9®—첦-ú+?¼{{`ÇS9&÷¼²Ç++¥Ø}õ︆Ëo-œ¹·Ñ‡@%|"M¸‰M‹†Š6ë‘Êefµ¬B–OÛ¬­[gµX_uÔUÜ9‹§ŸhÒé`¿/†xó“ÎaȎÉ1OçeDÇ»Û:çqì触S·UVºû‘à*bKCþø-ƒýô¡à3k¾p=¶‹~KÊ
‹ø„ñh<žf`
ïmbÂ"bܘ…P ÿIÃ2ÄéRIÁ§Tv    çÀH…:‚YaɕWE<r¡¿$á1 ¢ªO;¢§ÐÔ´RkÁ/Ÿ
×øl)"°ñ,
 ­'€žHdáÔ´¬l   G8e–îÆàÌs㗄Ïqœ0Ý-KMÃq«_Uç&‰¸žC@þG½xeÂÓÿIà°VƒÚ@Jô-ßÏ•ôiâAEAü>•^O?({@þ:}\ö{NvV®”1ÆCz¨d:ƒ) ™ñd5ÓD62óÉVf)µŽÚBnbŸ#_¢v1ϲûÈõyŒ²Õ0ãØiô:jì}Töùõ1û¡LIËåM3J%%#XY†VI¾¥§(" Š!ä
Šd¬r›+ÜHÉ)³”¼’RZU굎ºÏEî_m’4⯢AB" f¯QR´¼ñÒCŸ×µV++fµÅᤵ³   ÒxìCHÄ꫱ÏÆ>a
Þ(<+\é.ч®^Äۄ™Á9øÂs"¯6ÀëYé¾4‘7@”&Ý
"7¶Rtÿuí¼¦ÙY)âgEY„s«U°Ç©é(½VZ‹
@0ø#|¡´¢(4¼Ò2¨—C½ê^¨…úQ¨Ÿƒ²Ê -
Ìw@J‚ÇR#¬i‘Þ‰ ·vX5¼ü›'â锏ä8
¾Ð¡Z
P<D'qàóâÁ×Aր°^ª¸–žTß;¼žc»ÇòápI>ÞHFlŽe†Ì/:ŒÒ­
LkúÁÞL­‚ÎF'ÍíÉ0ŸE»Z—§šÁ0ŒyžÉ÷¬¬ÍNnX@M/¬ôN.˜œDì<~ֈ±Á%Äz_úĉÁŠbª‹/È­kÎï¯EÿáŸi‚Öð46Bî^€”-J뵌܇†R˜é|sa³êÄ<òOräz’º3Pâx“›îÅMöžΚS{Ώ_b(ÈnXÛJÍØ}2˜@l™–9eE©?Ø$NI,#¥xºâªÇ)¿tlAµ|²™Ä2Õ:Õ:-iV[¢æªIÚe‰f•.Òb‘ùf›M–¯³ZmܳçÆ 
œ÷P"‰»PWgčK‘†KJ‘Fˆo#&.ßwßʕk×®$2…¯…/àùGCHdÅÑÁwõïر{÷Žýs…çñÔo¿ÆӅg¾&xàåJa2µ
tS
Z1ŽO1dŠXáâX£p”VLmQçkm  Œ=Ænq[o·JQ%
Iã»<½H‘ìSÉîEwC¤ê/®_²2˅à‰ÕÓ<ÂE¬Ë1s
ñÄk¿™°ôÕڌÀ£„2»o ·{§ÔU<J³½*eì#®P3À“ŽÛH\µ‡ˆRCîTÅ[
¬šQ)8"‹à    Ò(æV¥[e6ðœ=Žº¹×Ôàt8‘ký7Ð~z@:ÐuÊÜ ÄÇJCLªú…"‡ð֖e×­¢f`,œ%‰Ž’5Á©²WےDŽϑ²ÐGT"åCiÜÉO”k™$«š”S¥²F1VYå(çªSΐ²¸N¥ Li”É–ž®g©ôdezz”QÁřjXc[›hËT¡¸Ú¨T›fÍÈ]‹wGCévøN.¬$Á“Ú“’¡ÞÖtŽ¤ÚRhš(%ÚâeêðD[2d8:¸$ȳgËãóœ2;9Yí?~àÌ   Œ
ÂçŒÕÓÙ4155´kꔿ]Bßã+xÆxΛ““eµŽÊ¬(_µåƒ§Žp#Gº³MæÂäIõ+~wòƒgIRŠÓB_Ké°Óqû´éQvUºîîDžÁ›X4ƒÁŒDÅRr7zäd    `ÍG³(žÓÅçƒÅƒÒéú×ZéN% nÙY†<ñwNb°Î,{Ó·bÛîÝ  ·¨ã4kË"ÛÞÂY©·‚‡Ç80~fk³çOˆúºDMßaß8žÏT˜l¦TS¡iëgœ8ŒI£VАßØÔ
·Í¢´Åâ|‹5&öºóíRï­
ž–x.f6%¢óÀºèÈÍF$7¯êDÒñv“»ë‘üX1+ 8Ә•S¾›âÁ_SeÁö&Ï侕˜tåÕmtŽ¡8ý÷͉‡ã€w6ðq·Peà¿ÿ“Ÿèf¦0s˜§™—R-QCˉnr    ÅTáuh-½…ÞI¿B%¿À?b9ÉQ”[g,Áaì–îñž!  …ŒnTá[•tË òÛzÃ5ƒöáw‘+8P‡n'râm›øÈDœ#,Þ–aþsªìÊ«TÙÕ³°
œ}؇
y(5
""…„¢€¨Fü—Ôøœb× Uc‡RBD¬ÅŽgvááˆðÑó/
ïÿsv²oHݕWIÛՋ¶E£PEՁ¬
(}Àw¸Ós›;ï* èÀËêÔ,§Åf«kØ&V<A–°ËÔjsT¾z­|r«2 d¢£U25Á©Tn½R©`XÂ.“¹õr¹ªF•R.cX9²é»6*š5RkÔJðw( æU†GdV¿Ñô\X»{@Î[®B>,–¡k’ÚóÁó%a¾‹V-þr„†×/]þzn„ù",Þªy® ÀŽMâo„Üa.áHp·+u¼ÍšF  xÔ_?áwÁ‡’23ñê·    B¡×-PQ]Wϒ®+§„ã›0ÉD#)®@ä¿ä?<Â7ËÈjòIP…ò{¨
ò÷¨äK%Q)ò|Ê+¯3Z#ÿ-õ˜|õ¼|/uPþ'ê/òórÝfêQ9!'E”Éä¤=‰õ$«€€NAÉ1IÛTXfùÓ1và†t"o<âµÚ‘’`±ô»¡µ+EMëïy#vy˜¬"»ƒgˆâ«“ ÏDrú¥à¦¯>!îÀtø\(¡_¢?A#ÐëáÛ‰°)ˆ¦   ˆª    Off<¥1`"vðRv”%J%”iPæBér”ÍPž²ÊbôÕTJ£÷¸…h
 ;¬+Æy°`‡@ɐÙM•}“Tc)†? e¥"Yr˜lI:jùAí<••m(È/´â}W.nÛ¹>a¬CtåŽá97’~Xžú#E
Ÿ¦b}ªËW¸ðá:FZ“+7Ü_oÒâûS…Å$‰»R…ZÓ´{×U%Û0J¾~_øþѺ$üÔ±œ‚¬Ìc{–í\ß;?¿y]bæ1ψ<Ï17·b¾·ëoÂ;'nß$ñZ*G~JûyfTñ÷H'“þ?øïܙ«Äú
Ç~þñìU·òNy%̓ýÏyx3)B
B*ã)ïŒô_ÿI¥ßDÛ/¨¥õ»êUÔ"ÄGJ¾X ÿ{½   s·3»ÐƋüô44•ú9 öÀœ"€W@]u   ÔøÆp”µ°æz(rè_
õ6p@d|%‡VÃXY䛕ۇj#…Ç *ÐÛ)ïÃæÀ˜pPžöä?5õBô„Ä¿SÁ¸œƒr!¥
Êàƒ!5|£jÎ#µ¤üßF„ôÀK|c¸¡h@k„5Œ™`Ü܌Ö°EÈ:!Ð`»ŠPÌ„baÝØKÅ[ |*þí„ÄåTÈ5ähd    ÒJ÷#ˆ½¤p‰ˆ£ÙDý,ê`Î5yÉQCoW|˜„ÜÁ)ðõë#0dí@ÒÍ˖Ì¢ôtÖCÿIGaýMÀÌ"0    ً*SH¹ˆö^œYÈgx    &¤þöÌ¢b¼R‚Iè7àý˜EYø„Sð¨ñÕÌ¢‘„\‚i±ŸYй:    f _G܁a"¼GVÄE¼)Á2„B",ç“& ~-é–hW‰t’l—ú+¦Á^r*Ј)éF‹œ%ÁbÞȐs%˜XF.`¹wK°RÌ-É;$X%õ?"ÁZ€õä6   ÖIý/H°^Zs¯$ø€GKða sU
ù†[V‘§%8Nšó©ËÂt^a•BšÿU~föœìBös\¶×›•‘“••Ï•vt,ðscÚÛ:wû»¸±ggrâxa67¾ua{÷²?WZÅE>ñfsSĞÉíw·¶/\$}Pí_Ðãïní+k_0g¨Á‰­"îÒ"ðy3²ò3r¼7ÎjeÜø}ë"ÎÇuùçµ.Êüs¸î.ß›¯ëv®}îpân\m|ë<_÷â.ÿ"ØS[ûBn¨=aqÛ,ØáäßìօóêºÚ;Ú»Ä=øpá¡EµíÛÁ0àŠô\_ºªË7¬å_ÔÍ-^è瀚ÎÅ~nÎí‹€Úsº*ýÝÝÃ÷ÕÍù[Šež65Ͽпˆ»Å×å_¸Üß:»Å¿êºùJó€ŒE°Zûõ®‘Ù-­ívuùºÛ¹9­7_ªxì|óü‘ýþ®9\‹oáÿ‚Em~`uoá?ÝÓõFoD]Ãîüða×â¶Þ®ÛýÜÜmÿd3üóºZ»}ÃÙ
{ñI›ñ
meŽïŸòÕ/¢½ŽÕ/aõ·½*bín]Ðò_ÀÒöí_|è›3Œ“Õ¾=s‡okAk‹¯g}z÷ҟ¯·Ž¾}âÝã_\o÷Ùù;¶l;
ñŠ͑ÞÝð”¡vðïsÀ/– V©´€×îR…ºà½h¨^ùª=ëlÖÕ¬ËYÇþpuïÙ{!þûÃkN€÷<˜ß*ΣLT!ÅS£¨Ê¿cËÄÉ7ÌñIÞUú    ‰Et“12 j(£¡ŒH³ƒÝ2þ~E7
òrl§Òþ¾õ¯¯âLð˜¥wÎäUj$Ÿ½¦Ø>{͚ê”R9®AÜØqrIuy¿ëy{îw9¡®ˆþ‚8¨//pك³ìW2ÌÇØrýÆþ#”\%öï]Ùö·aÞ©‚*ûÉRï·¿•  :î
P˜²uÝm¹ ž· ÈÞ~ûîR¨öÙwÜmê©çw©RµÝÀÛúíOŠÕ>û°þæ5ÒÀ¦ð‡«ÃUÇ=¢ö=RµpO€x~Ÿ½Í•dŸb^ior-°ÏpyíSJ8±ß^ë–ðwŸ´×”J+ðaDùáÕó\Å9a´é®ƒöä0†q6o°s®ñö8X?ýÉÍ0|›½45€Ÿ}¥:9ÕUíޜÀƒ±Ú®†«ÙîCx'hM
žŽñ£{ªS€füP¿}
TÛöT'$ÈK¼Þ¾Ç]í¾J>”D(Sx
ŸÎnaç°SÙ\6Ma“ pgcØh™^¦•id*™B&“Aš #dBŸòiâyÍhŊ¡¤ƒ^‚µ„ø&ÂQe‡º×ÔSb)яÖy+ËoòjŽ¼Ó®ÿX†Ã8®osÍ䆾]q}9"ŠkLû÷üøËàUS¿lOý²ËÓ¤¿qVø¡4÷mèi±ô­šÅq»//‹üqORó¬Ù-bíó÷-súËû.;˹ÝõÓn2<M®w–ïFÓ*¦4ìžÆûËûëùzñï[÷ÔUTO¸×ú!\Õ7Y¬B\¬ZÄU7á&ÃÄá:××W_'áJK«h\†þ¾ö€Ã
endstream
endobj

22 0 obj
10225
endobj

19_0_decompressed.txt 20_0_decompressed.txt

Rob--W commented 8 years ago

Here is the PDF file that produces the reported issue: issue7445.pdf

Snuffleupagus commented 8 years ago

@LeonMelis For your information, PR #7475 added a parameter (disableCombineTextItems) that when set disables the heuristics that attempt to combine text runs. I hope this is helpful in your use-case!

timvandermeij commented 8 years ago

This may be solved by implementing something like this: https://github.com/danigm/poppler/blob/0011805e22193b690b99a53dcb9986ce04eb3eb4/poppler/TextOutputDev.cc#L1210