pdfminer can't extract text from some pdffiles but pypdf can?

pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

https://pdfminersix.readthedocs.io

MIT License

5.95k stars 930 forks source link

pdfminer can't extract text from some pdffiles but pypdf can? #841

Open ramtalentrecruit opened 1 year ago

ramtalentrecruit commented 1 year ago

Feature request

Thanks for your suggestion on improving pdfminer.six. To helps us discuss and implement this request, please make sure to include the following information:

There are a few types of pdf files which contain very detailed information and are in different styles.
These pdf_files contain images but text can be extracted without OCR. That's why pypdf can extract information from those pdf_files.

vilabho commented 1 year ago

Could you provide these pdf files here? also did those pdfs had only images and no text..? If so, then how did you imply that OCR was not used and still text got extracted?

mrm202 commented 1 year ago

Thanks for your response. I told you pypdf extracted text from those files, these files contain images+text. Task is to extract text not mages. I can't provide those files here but will be very happy to share in mail. You can send email here

vilabho commented 1 year ago

I have sent an email, kindly share your files there

mrm202 commented 1 year ago

I didn't get your email id. Can you send again please at this email id? mularamiit@gmail.com

vilabho commented 1 year ago

I have sent the reply again on the mailid mentioned above. Please check in Spam/Junk folder of your inbox as well.