rockyzhengwu / rspdf

PDF library in Rust
Apache License 2.0
40 stars 1 forks source link

Feature request: extract plaintext / markdown #1

Open joepio opened 9 months ago

joepio commented 9 months ago

Hi there! Thanks for creating and sharing this :)

One quite common use case with PDF libraries, is to get the text form a PDF. This is often used for things like indexing documents in a search engine. There is a project in Rust that does this called pdf-extract but I'd love to see an alternative to this (for a couple of reasons)

I noticed rspdf has a way to extract XML text from a PDF. I was wondering whether it would also be possible to extract content as plaintext? Or even better: extract it as markdown!

Perhaps this is completely out of scope for the project. Maybe I could help out with this someday (have some plans in this regard) if you think it may be a good fit.

Cheers!

rockyzhengwu commented 9 months ago

Thanks, i'm glad your have interest about this project.

This project is primarily centered around extracting text and images and converting them to other formats at now, plain text. Markdown support is a potential future addition.

However, there are numerous bugs to address, particularly related to fonts. Consequently, the timeline for completion is uncertain