ol-th / pdf-img-convert.js

Simple node package to convert a PDF into images.
MIT License
156 stars 33 forks source link

Current rendering implementation fails to recognise some characters/fonts #3

Open ol-th opened 3 years ago

ol-th commented 3 years ago

I'm putting this on here as I haven't been able to find a solution as of now.

When rendering a PDF like https://gahp.net/wp-content/uploads/2017/09/sample.pdf the numbers on lists are not recognised by the PDF.js engine resulting in strange characters appearing: output6 output7

manoadamro commented 1 year ago

Same issue when testing with W3C dummy pdf from here with 1.0.6

Screenshot 2022-09-12 at 13 37 45
GitHubRulesOK commented 11 months ago

The problem with links Is they die so for reference the OP issue is ComputerModern Bullet Points will have their own encoding thus nothing to directly replace them with other than a UTF square.

image

Here is the offending symbol before the next line of text

BT
/F38 11.96 Tf 0 0 Td[()]TJ
ET
1 0 0 1 5.97 0 cm
1 1 1 1 k 1 1 1 1 K
1 0 0 1 5.86 0 cm
BT
/F44 11.96 Tf 0 0 Td[(doc/latex/general/latex2e.dvi)]TJ/F28 11.96 Tf 211.01 0 Td[(and)]TJ
ET

here is descriptions for the symbols (/bullet) thus would have been best if that had been placed "inline" as an indicator of authors intent

211 0 obj <</Ascent 750/CapHeight 683/Descent 0/FontName 210 0 R/ItalicAngle -14/StemV 85/XHeight 430/FontBBox[-29 -960 1116 775]/Flags 4/CharSet(/bullet/greaterequal/arrowright/arrowdblright/element/negationslash/backslash/radical)/FontFile 205 0 R>> endobj

here is the printer definition

205 0 obj
<</Length 2964/Length1 207 0 R/Length2 208 0 R/Length3 209 0 R>>
stream
%!PS-AdobeFont-1.1: CMSY10 1.0
%%CreationDate: 1991 Aug 15 07:20:57
% Copyright (C) 1997 American Mathematical Society. All Rights Reserved.
11 dict begin
/FontInfo 7 dict dup begin
/version (1.0) readonly def
/Notice (Copyright (C) 1997 American Mathematical Society. All Rights Reserved) readonly def
/FullName (CMSY10) readonly def
/FamilyName (Computer Modern) readonly def
/Weight (Medium) readonly def
/ItalicAngle -14.035 def
/isFixedPitch false def
end readonly def
/FontName /YLJAAA+CMSY10 def
/PaintType 0 def
/FontType 1 def
/FontMatrix [0.001 0 0 0.001 0 0] readonly def
/Encoding 256 array
dup 15 /bullet put

Hmm interesting /ItalicAngle -14.035 def ! rugby ball ovoid ? more like a void. 🙂

copy of file for analysis NOTE should be similar to https://github.com/ol-th/pdf-img-convert.js/blob/master/examples/test_pdfs/sample.pdf 🙂 sample (4).pdf

GitHubRulesOK commented 11 months ago

second sample is different apart from poor spacing it should work well on a Windows device and any other if imbedded fonts are used

image