Closed Arwa200 closed 4 years ago
Hi @Arwa200 Could you please share the output of python -c "import pdfplumber"
?
On the contrary, it looks like you are using Python 2 without a virtual environment. Unless this is intentional, I would recommend you to switch to Python 3 (3.6+) and use a virtual environment.
I'm using python 3 I will try the virtual environment.
Would it be possible for you to share steps to reproduce this issue? Also the OS, Python version and library version
For me, the steps are
$ python3.8 -m venv venv # create a py38 virtual environment by the name "venv"
$ source venv/bin/activate # activate the virtual environment
(venv) $ pip install pdfplumber # install the latest version of the library
(venv) $ curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf && pdfplumber < background-checks.pdf > background-checks.csv # run the command
and it works fine
I still have the empty file..
another question, where can I enter the keyword that I want to pull a whole raw information that has related to that row
thank you
Could you please run with background-checks.pdf
and not 2020-shipping-section.pdf
because there is no such file at the path https://github.com/jsvine/pdfplumber/tree/stable/examples/pdfs?
where can I enter the keyword that I want to pull a whole raw information that has related to that row
Could you please elaborate a more on this?
sure its worked but the question is how to make it worked in my PDF? I mean if I want to search for specific keyword in PDF file like "ABC" related to that word..?
Hi @Arwa200 Could your question be rephrased as "How do I get the coordinates of a particular word in a PDF?" If yes, you can use the .extract_words()
function for that. If you could share the PDF and the expected output, that'd be helpful.
Yes, this my question How do I get the coordinates of a particular word in a PDF and save it into CSV? I would rather not to share it since there are many, so would you please tell how to add the PDF files in the path and how to use .extract_words() in Command line
appreciate your efforts..
@Arwa200 Try using the following
import pdfplumber
pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]
for word_dict in p.extract_words():
if word_dict["text"] == "WordToFind":
print(word_dict)
break
You can find the coordinates in word_dict
. Replace WordToFind
with the word you want to find.
I got these errors,
also How to save results in CSV in this case?
I'm wondering where can I find the same code (in the up) in test folder or pdfplumber folder
also I changed the URL, but still there is an error
The error in https://github.com/jsvine/pdfplumber/issues/272#issuecomment-692504902 means you don't have pdfminer
installed. Are you sure you have activated the correct virtual environment and running pip install pdfplumber
didn't throw any error? Have you tried reinstalling pdfplumber
? Have you tried running pip install pdfminer.six
instead?
For https://github.com/jsvine/pdfplumber/issues/272#issuecomment-692523459, could you confirm if python -c "import pdfplumber"
is running without any error?
yes pdfplumber it's already there..
python -c "import pdfplumber":
Could you please switch to Python 3.6 or higher? Looks like you are on 2.7 and if I am correct, the support for it was dropped in pdfplumber 0.5.15. The latest version is 0.5.23
it's pyhton 3
Okay. Request you to run the following steps in order and share the output of those.
mkdir test_folder
.cd test_folder
.python -m venv venv
. I am assuming python
by default refers to Python 3.8 on your system. If not, replace python
with the correct keyword.source venv/bin/activate
.python -V
.python -m pip install pdfplumber --no-cache-dir
.python -c "import pdfplumber"
main.py
and save the code in https://github.com/jsvine/pdfplumber/issues/272#issuecomment-692121431 in it (updating the file paths).python main.py
.Hi
just tried what you write, I got this error in python3 I do not know how to open the file
Thank you for trying out the steps @Arwa200 Did you follow step 9? If you did and created a file main.py
, was it in the root folder test_folder
or somewhere else? The reason the last step is failing because there is no file main.py
in the folder test_folder
.
Don't forget to update file.pdf
with the correct PDF path.
Thank you for being patient.. I tried it, and got this..
also I put the file.pdf at the same folder in main.py
main.py is in the test_folder
No problem @Arwa200 I am here to help only :)
Could you run python main.py
again but this time use background-checks.pdf and not your file (don't forget to update the path)? If it works and throws no error, there might be a problem with your PDF. It could be corrupted. Perhaps repairing it via Ghostscript would help. Can repair the PDF by running
gs -o "output.pdf" -sDEVICE=pdfwrite input.pdf
You would need to have Ghostscript installed. Without the PDF, there's not much I can do. Maybe you can redact sensitive information from the PDF and then share it?
Searching around for PDFSyntaxError: No /Root object! - Is this really a PDF?
, it looks like a lot of people had this issue and were using Windows and were able to solve by opening the file in binary mode.
Found a similar issue on pdminer.six https://github.com/pdfminer/pdfminer.six/issues/476 and I repaired the PDF with ghostscript and it worked fine.
The result after trying your PDF
In below code, you have to replace WordToFind
and not text
with Iowa
for word_dict in p.extract_words():
if word_dict["text"] == "WordToFind":
print(word_dict)
break
It would become
for word_dict in p.extract_words():
if word_dict["text"] == "Iowa":
print(word_dict)
break
Also, the fact that the code ran so far confirms that the issue stems from the PDF. I would recommend you to give repairing it a shot. Should resolve your issue.
it worked with your file :) for both functions (word_dict/WordToFind )
I will check the ghostscript you just mentioned..
But How can I save the result into csv?
Thank you again!
For saving the data into a CSV file, you should look at the csv module in Python. A good tutorial can also be found here.
I am closing this issue now. If repairing the PDF does not solve the issue (in which case opening an issue on pdfminer.six GitHub would be more suited since pdfplumber
uses it for parsing PDFs internally) or you need further assistance, feel free to reopen it.
Hi
I just tried the command line: curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf then: pdfplumber < background-checks.pdf > background-checks.csv
but i got the "command not found: pdfplumber" while I already install the pdfplumber
please see the attachment..