jlsutherland / doc2text

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.
MIT License
1.27k stars 98 forks source link

Error on doc.process() #14

Open rsteca opened 8 years ago

rsteca commented 8 years ago

When doing:

import doc2text
doc = doc2text.Document()
doc.read('something.pdf')
doc.process()

I get:

Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 23
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 197
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 77
dst is not a numpy array, neither a scalar

And then, when I do:

doc.extract_text()

I get:

AttributeError                            Traceback (most recent call last)
<ipython-input-5-57184997370d> in <module>()
----> 1 doc.extract_text()

/usr/local/lib/python2.7/dist-packages/doc2text/__init__.pyc in extract_text(self)
     89             for page in self.processed_pages:
     90                 new = page
---> 91                 text = new.extract_text()
     92                 self.page_content.append(text)
     93         else:

/usr/local/lib/python2.7/dist-packages/doc2text/page.pyc in extract_text(self)
     36     def extract_text(self):
     37         temp_path = 'text_temp.png'
---> 38         cv2.imwrite(temp_path, self.image)
     39         self.text = pytesseract.image_to_string(Image.open(temp_path))
     40         os.remove(temp_path)

AttributeError: Page instance has no attribute 'image'
remi-pr commented 8 years ago

If I am not mistaken this is due to using OpenCV version 2.x rather than 3.0. In this case cv2.resize() interprets the third argument as a destination array rather than the interpolation method as intended. One fix, included in the pull request #13, is to use named arguments (cv2.resize(foo, bar, interpolation=cv2.INTER_AREA) on line 77). This should make the code compatible with both OpenCV versions.

The second error you are getting is probably a consequence of the first (self.image is not produced because the downscale_image call failed).

jlsutherland commented 8 years ago

@remi-pr is correct. Merging #13 should fix part of the issue and result in increased localizability for stacktraces.