shebinleo / pdf2html

pdf2html is a module which helps to convert PDF file to HTML pages using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.
https://www.npmjs.com/package/pdf2html
Apache License 2.0
154 stars 33 forks source link

Converted HTML returns blank html #38

Closed Ekansh19 closed 1 year ago

Ekansh19 commented 2 years ago

Hi,

After converting the pdf to HTML, am getting the same HTML code against all the files(different) and with almost blank body data. jpg2pdf.pdf

Result HTML:

Screenshot 2022-05-27 at 3 19 38 PM
reregaga commented 2 years ago

This lib cannot get or extract images (like figures or graphs) from pdf, but you can create an image (thumbnail) from whole page:

const options = { page: 1, imageType: 'png', width: 160, height: 226 }
pdf2html.thumbnail('sample.pdf', options, (err, thumbnailPath) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(thumbnailPath)
    }
})

For more advanced manipulations use node-poppler or Mozilla's pdf.js

shebinleo commented 1 year ago

Like @reregaga mentioned. this library doesn't extract images. Please feel free to do PR if you would like to add this.