opengovsg / pdf2md

A PDF to Markdown converter
https://www.npmjs.com/package/@opendocsg/pdf2md
MIT License
210 stars 40 forks source link

Extracting visuals and/or embedded images as images #58

Open dylans opened 2 years ago

dylans commented 2 years ago

Mozilla's PDF.js generates a canvas view which makes it easier to retain styles and layout. This is not really what a markdown converter should do.

That said, I've been wondering if there's a decent way to either extract embedded images as inline encoded images from markdown, or perhaps have the option to extract the content and use a headless version of the canvas render to perhaps embed images of the original pages from the PDF. Both could get included inline as base64 images.

  1. Would this be useful here or is this outside the scope of what this project wants to do?
  2. Is there a better way to achieve what I'm describing?
  3. If there's interest in what I've described, I'm happy to do the bulk of the work to make it happen, but I'd appreciate some guidance so we end up with a PR that meets the project's expectations.
LoneRifle commented 2 years ago

This would indeed be useful and within the project scope. I don't think there is a better approach, and you are free to work through this!

galleon commented 10 months ago

Hi @dylans any update on this ?

rightpossible commented 1 month ago

Any updates??