jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

command not found issue #272

Closed Arwa200 closed 4 years ago

Arwa200 commented 4 years ago

Hi

I just tried the command line: curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf then: pdfplumber < background-checks.pdf > background-checks.csv

Screen Shot 2020-09-14 at 11 59 19 AM but i got the "command not found: pdfplumber" while I already install the pdfplumber

please see the attachment..

samkit-jain commented 4 years ago

Hi @Arwa200 Could you please share the output of python -c "import pdfplumber"?


On the contrary, it looks like you are using Python 2 without a virtual environment. Unless this is intentional, I would recommend you to switch to Python 3 (3.6+) and use a virtual environment.

Arwa200 commented 4 years ago

I'm using python 3 Screen Shot 2020-09-14 at 12 42 37 PM I will try the virtual environment.

Arwa200 commented 4 years ago

Screen Shot 2020-09-14 at 1 01 23 PM

samkit-jain commented 4 years ago

Would it be possible for you to share steps to reproduce this issue? Also the OS, Python version and library version

For me, the steps are

$ python3.8 -m venv venv  # create a py38 virtual environment by the name "venv"
$ source venv/bin/activate  # activate the virtual environment
(venv) $ pip install pdfplumber  # install the latest version of the library
(venv) $ curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf && pdfplumber < background-checks.pdf > background-checks.csv  # run the command

and it works fine

Arwa200 commented 4 years ago

I still have the empty file..

Screen Shot 2020-09-14 at 2 48 46 PM

another question, where can I enter the keyword that I want to pull a whole raw information that has related to that row

thank you

samkit-jain commented 4 years ago

Could you please run with background-checks.pdf and not 2020-shipping-section.pdf because there is no such file at the path https://github.com/jsvine/pdfplumber/tree/stable/examples/pdfs?

where can I enter the keyword that I want to pull a whole raw information that has related to that row

Could you please elaborate a more on this?

Arwa200 commented 4 years ago

sure its worked but the question is how to make it worked in my PDF? I mean if I want to search for specific keyword in PDF file like "ABC" related to that word..?

samkit-jain commented 4 years ago

Hi @Arwa200 Could your question be rephrased as "How do I get the coordinates of a particular word in a PDF?" If yes, you can use the .extract_words() function for that. If you could share the PDF and the expected output, that'd be helpful.

Arwa200 commented 4 years ago

Yes, this my question How do I get the coordinates of a particular word in a PDF and save it into CSV? I would rather not to share it since there are many, so would you please tell how to add the PDF files in the path and how to use .extract_words() in Command line

appreciate your efforts..

samkit-jain commented 4 years ago

@Arwa200 Try using the following

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]

for word_dict in p.extract_words():
    if word_dict["text"] == "WordToFind":
        print(word_dict)
        break

You can find the coordinates in word_dict. Replace WordToFind with the word you want to find.

Arwa200 commented 4 years ago

I got these errors, Screen Shot 2020-09-15 at 9 48 43 AM

also How to save results in CSV in this case?

Arwa200 commented 4 years ago

I'm wondering where can I find the same code (in the up) in test folder or pdfplumber folder

Arwa200 commented 4 years ago

also I changed the URL, but still there is an error Screen Shot 2020-09-15 at 10 27 01 AM

samkit-jain commented 4 years ago

The error in https://github.com/jsvine/pdfplumber/issues/272#issuecomment-692504902 means you don't have pdfminer installed. Are you sure you have activated the correct virtual environment and running pip install pdfplumber didn't throw any error? Have you tried reinstalling pdfplumber? Have you tried running pip install pdfminer.six instead?

For https://github.com/jsvine/pdfplumber/issues/272#issuecomment-692523459, could you confirm if python -c "import pdfplumber" is running without any error?

Arwa200 commented 4 years ago

yes pdfplumber it's already there.. Screen Shot 2020-09-15 at 12 42 44 PM

python -c "import pdfplumber": Screen Shot 2020-09-15 at 12 43 01 PM

samkit-jain commented 4 years ago

Could you please switch to Python 3.6 or higher? Looks like you are on 2.7 and if I am correct, the support for it was dropped in pdfplumber 0.5.15. The latest version is 0.5.23

Arwa200 commented 4 years ago

it's pyhton 3 Screen Shot 2020-09-15 at 1 03 53 PM

samkit-jain commented 4 years ago

Okay. Request you to run the following steps in order and share the output of those.

  1. Start a new terminal session.
  2. Create a new folder. mkdir test_folder.
  3. Switch to the new folder. cd test_folder.
  4. Create a new Python 3.8 virtual environment by the name "venv". python -m venv venv. I am assuming python by default refers to Python 3.8 on your system. If not, replace python with the correct keyword.
  5. Activate the virtual environment. source venv/bin/activate.
  6. Verify Python installation. python -V.
  7. Install pdfplumber. python -m pip install pdfplumber --no-cache-dir.
  8. Verify pdfplumber installation. python -c "import pdfplumber"
  9. Create a new Python file main.py and save the code in https://github.com/jsvine/pdfplumber/issues/272#issuecomment-692121431 in it (updating the file paths).
  10. Run the script python main.py.
Arwa200 commented 4 years ago

Hi

just tried what you write, I got this error in python3 I do not know how to open the file

Screen Shot 2020-09-16 at 11 12 45 AM

Screen Shot 2020-09-16 at 11 11 07 AM

samkit-jain commented 4 years ago

Thank you for trying out the steps @Arwa200 Did you follow step 9? If you did and created a file main.py, was it in the root folder test_folder or somewhere else? The reason the last step is failing because there is no file main.py in the folder test_folder.

Don't forget to update file.pdf with the correct PDF path.

Arwa200 commented 4 years ago

Thank you for being patient.. I tried it, and got this..

Screen Shot 2020-09-16 at 3 16 17 PM

also I put the file.pdf at the same folder in main.py

main.py is in the test_folder Screen Shot 2020-09-16 at 3 09 02 PM

Screen Shot 2020-09-16 at 3 13 28 PM

samkit-jain commented 4 years ago

No problem @Arwa200 I am here to help only :)

Could you run python main.py again but this time use background-checks.pdf and not your file (don't forget to update the path)? If it works and throws no error, there might be a problem with your PDF. It could be corrupted. Perhaps repairing it via Ghostscript would help. Can repair the PDF by running

gs -o "output.pdf" -sDEVICE=pdfwrite input.pdf 

You would need to have Ghostscript installed. Without the PDF, there's not much I can do. Maybe you can redact sensitive information from the PDF and then share it?

samkit-jain commented 4 years ago

Searching around for PDFSyntaxError: No /Root object! - Is this really a PDF?, it looks like a lot of people had this issue and were using Windows and were able to solve by opening the file in binary mode.

samkit-jain commented 4 years ago

Found a similar issue on pdminer.six https://github.com/pdfminer/pdfminer.six/issues/476 and I repaired the PDF with ghostscript and it worked fine.

Arwa200 commented 4 years ago

The result after trying your PDF

Screen Shot 2020-09-16 at 3 48 43 PM

samkit-jain commented 4 years ago

In below code, you have to replace WordToFind and not text with Iowa

for word_dict in p.extract_words():
    if word_dict["text"] == "WordToFind":
        print(word_dict)
        break

It would become

for word_dict in p.extract_words():
    if word_dict["text"] == "Iowa":
        print(word_dict)
        break

Also, the fact that the code ran so far confirms that the issue stems from the PDF. I would recommend you to give repairing it a shot. Should resolve your issue.

Arwa200 commented 4 years ago

it worked with your file :) for both functions (word_dict/WordToFind )

Screen Shot 2020-09-16 at 4 05 26 PM

I will check the ghostscript you just mentioned..

But How can I save the result into csv?

Thank you again!

samkit-jain commented 4 years ago

For saving the data into a CSV file, you should look at the csv module in Python. A good tutorial can also be found here.

I am closing this issue now. If repairing the PDF does not solve the issue (in which case opening an issue on pdfminer.six GitHub would be more suited since pdfplumber uses it for parsing PDFs internally) or you need further assistance, feel free to reopen it.