smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Poll: Do you want an executable to get text from a PDF in the terminal? #636

Open k00ni opened 10 months ago

k00ni commented 10 months ago

I am curious if there is a need for a (standalone) executable to get text from a given PDF?

It would be a PHP script still, but can be called in the terminal for shell related tasks. Maybe something like the following?

# show text of PDF file
$ ./pdfparser/bin/get_text /foo/Bar.pdf

This is example text ...

or

# write raw text of PDF file into a file
$ ./pdfparser/bin/get_text /foo/Bar.pdf > pdf_text.txt

When running this command, the extracted text of /foo/Bar.pdf will be written to pdf_text.txt. But one could also use it to directly search in it via grep etc.


If you need/want something like this please use emoticon :+1:, otherwise :-1:. Comments and ideas are welcome.

Thank you for taking the time.

GreyWyvern commented 10 months ago

I switched from pdftotext to PdfParser specifically so my search engine (that scans HTML and PDF files) could have an all PHP solution instead of requiring a binary. But a binary might be useful in other situations.

I think the key argument for/against would be: Can PdfParser do a better job than pdftotext? It's a pretty mature product. https://www.xpdfreader.com/pdftotext-man.html

Reqrefusion commented 9 months ago

It can really make things easier in some areas. I have encountered a situation where I needed something like this a few times.