ol-th / pdf-img-convert.js

Simple node package to convert a PDF into images.
MIT License
161 stars 38 forks source link

Converts PDF without Fonts/Text #42

Open dms-ts opened 1 year ago

dms-ts commented 1 year ago

I'm trying to convert some shipping labels to png, it converts the barcodes and images, but no text/fonts. I already installed Font fix but it doesn't works.

tabetommy commented 1 year ago

I'm having thesame issue with @dms-ts. Please any ideas?

Jussinevavuori commented 1 year ago

Same issue, however only for some types of PDFs. Regular PDF files uploaded from the user's device can be converted fine as they are, however for some reason this library fails to convert PDFs created with React PDF.

ojtramp commented 1 year ago

I'm having the same issue - any advice? I've installed Microsoft Fonts and have checked that Arial is installed on my EC2 Ubuntu system running node but still no luck.

I'm looking for a package that doesn't save to the file system and can import a PDF from URL and export an array of images. I'm very happy with this package with the exception of missing some text (obviously a big problem), but happy to switch an alternative if anyone has any advice?

ojtramp commented 1 year ago

I changed the verbosity of the PDF.js command to 1 so that I could get the following error messages, the once relating to Helvetica match the text that is missing. These are my error messages:

Warning: fetchStandardFontData: failed to fetch file "FoxitSans.pfb" with "UnknownErrorException: The standard font "baseUrl" parameter must be specified, ensure that the "standardFontDataUrl" API parameter is provided.".
Warning: fetchStandardFontData: failed to fetch file "FoxitSansBold.pfb" with "UnknownErrorException: The standard font "baseUrl" parameter 
Warning: getPathGenerator - ignoring character: "Error: Requesting object that isn't resolved yet Helvetica_path_T.".
Warning: getPathGenerator - ignoring character: "Error: Requesting object that isn't resolved yet Helvetica_path_h.".

I think my system is saying that it would substitute the Helvetica with Arial:

fc-match Helvetica
Arial.ttf: "Arial" "Regular"

So not sure whats going on... I'll keep trying to find a solution and post back if I find something.

ojtramp commented 1 year ago

Think I found a fix that is legit:

I changed line 100 in the file pdf-img-convert.js:

var loadingTask = pdfjs.getDocument({data: pdfData, disableFontFace: false, verbosity: 0});

It looks like this should be okay from the 2018 answer here.

ojtramp commented 1 year ago

So that didn't work, as mentioned in the earlier part of that 2018 thread that change will break other documents' fonts.

deathemperor commented 11 months ago

I'm able to resolve this issue using this instruction https://github.com/mozilla/pdf.js/issues/4244#issuecomment-1232548915

final version:

diff --git a/pdf-img-convert.js b/pdf-img-convert.js
index 01e8c64c9ffa13ea226a689fa08e78d97213dabe..97939693584b700a985fe3ef3a2fe054a26ddf41 100644
--- a/pdf-img-convert.js
+++ b/pdf-img-convert.js
@@ -29,6 +29,7 @@ const Canvas = require("canvas");
 const assert = require("assert").strict;
 const fs = require("fs");
 const util = require('util');
+const path = require('path');

 const readFile = util.promisify(fs.readFile);

@@ -95,9 +96,9 @@ module.exports.convert = async function (pdf, conversion_config = {}) {

   // At this point, we want to convert the pdf data into a 2D array representing
   // the images (indexed like array[page][pixel])
-
+  let packagePath = path.dirname(require.resolve("pdfjs-dist/package.json"));
   var outputPages = [];
-  var loadingTask = pdfjs.getDocument({data: pdfData, disableFontFace: true, verbosity: 0});
+  var loadingTask = pdfjs.getDocument({data: pdfData, disableFontFace: true, verbosity: 0, standardFontDataUrl: packagePath + '/standard_fonts/'});

   var pdfDocument = await loadingTask.promise

@ol-th would you accept a PR for this?

YoricWatterott commented 9 months ago

I would also like to bump this issue, I will have to look for another library to use if this issue doesn't get solved Has anyone looked at @deathemperor's response? could it work?

Love the simplicity of using this library, just hope this issue can get resolved all the best

I'm able to resolve this issue using this instruction mozilla/pdf.js#4244 (comment)

final version:

diff --git a/pdf-img-convert.js b/pdf-img-convert.js
index 01e8c64c9ffa13ea226a689fa08e78d97213dabe..97939693584b700a985fe3ef3a2fe054a26ddf41 100644
--- a/pdf-img-convert.js
+++ b/pdf-img-convert.js
@@ -29,6 +29,7 @@ const Canvas = require("canvas");
 const assert = require("assert").strict;
 const fs = require("fs");
 const util = require('util');
+const path = require('path');

 const readFile = util.promisify(fs.readFile);

@@ -95,9 +96,9 @@ module.exports.convert = async function (pdf, conversion_config = {}) {

   // At this point, we want to convert the pdf data into a 2D array representing
   // the images (indexed like array[page][pixel])
-
+  let packagePath = path.dirname(require.resolve("pdfjs-dist/package.json"));
   var outputPages = [];
-  var loadingTask = pdfjs.getDocument({data: pdfData, disableFontFace: true, verbosity: 0});
+  var loadingTask = pdfjs.getDocument({data: pdfData, disableFontFace: true, verbosity: 0, standardFontDataUrl: packagePath + '/standard_fonts/'});

   var pdfDocument = await loadingTask.promise

@ol-th would you accept a PR for this?

deathemperor commented 9 months ago

Hope you find it useful. That patch successfully converts our 300+ pdf daily

YoricWatterott commented 9 months ago

how can i implement your change @deathemperor? has it been patched into the latest version? or do you mean you made the change yourself in the lib files?

I can't edit the file directly, because i have a pipeline that does npm install

if i indeed have to implement that change myself i'll have to add a script to my pipeline to edit the file after the fact

i'd prefer not to do that, so If you have an alternative suggestion that would be great

thanks for your response though @deathemperor appreciate your time

ol-th commented 9 months ago

@deathemperor if you could send a PR for this fix that would be great. I'll test it out and add it to a new release if all good.

deathemperor commented 9 months ago

how can i implement your change @deathemperor? has it been patched into the latest version? or do you mean you made the change yourself in the lib files?

I can't edit the file directly, because i have a pipeline that does npm install

if i indeed have to implement that change myself i'll have to add a script to my pipeline to edit the file after the fact

i'd prefer not to do that, so If you have an alternative suggestion that would be great

thanks for your response though @deathemperor appreciate your time

I use https://www.npmjs.com/package/patch-package to maintain patches like these until the repo officially supports.

deathemperor commented 9 months ago

@deathemperor if you could send a PR for this fix that would be great. I'll test it out and add it to a new release if all good.

sure, here's the PR https://github.com/ol-th/pdf-img-convert.js/pull/50

YoricWatterott commented 9 months ago

Hi guys, has this been merged into latest? I'd love to start using this, thanks

YoricWatterott commented 9 months ago

Hi @deathemperor, thank you so much for leading me to https://www.npmjs.com/package/patch-package

I managed to implement it successfully to continue using the library seemlessly.

much appreciated

deathemperor commented 9 months ago

Hi @deathemperor, thank you so much for leading me to https://www.npmjs.com/package/patch-package

I managed to implement it successfully to continue using the library seemlessly.

much appreciated

I'm glad it helped!