virantha / pypdfocr

Python script to do PDF OCR conversion using Tesseract
Apache License 2.0
372 stars 114 forks source link

Fails to run on Mac OS High Sierra #80

Open christmasjumper opened 6 years ago

christmasjumper commented 6 years ago

Hi, getting the following when trying to run on HS:

Traceback (most recent call last): File "/usr/local/bin/pypdfocr", line 9, in load_entry_point('pypdfocr==0.9.1', 'console_scripts', 'pypdfocr')() File "/Library/Python/2.7/site-packages/pypdfocr/pypdfocr.py", line 492, in main script.go(sys.argv[1:]) File "/Library/Python/2.7/site-packages/pypdfocr/pypdfocr.py", line 474, in go self._convert_and_file_email(self.pdf_filename) File "/Library/Python/2.7/site-packages/pypdfocr/pypdfocr.py", line 480, in _convert_and_file_email ocr_pdffilename = self.run_conversion(pdf_filename) File "/Library/Python/2.7/site-packages/pypdfocr/pypdfocr.py", line 363, in run_conversion ocr_pdf_filename = self.pdf.overlay_hocr_pages(img_dpi, hocr_filenames, pdf_filename) File "/Library/Python/2.7/site-packages/pypdfocr/pypdfocr_pdf.py", line 145, in overlay_hocr_pages text_pdf_filename = self.overlay_hocr_page(dpi, hocr_filename, img_filename) File "/Library/Python/2.7/site-packages/pypdfocr/pypdfocr_pdf.py", line 245, in overlay_hocr_page self.add_text_layer(pdf,hocr_basename,pg_num,height,dpi) File "/Library/Python/2.7/site-packages/pypdfocr/pypdfocr_pdf.py", line 349, in add_text_layer para.drawOn(pdf, x72/dpi, height - y72/dpi) File "/Library/Python/2.7/site-packages/reportlab/platypus/flowables.py", line 113, in drawOn self._drawOn(canvas) File "/Library/Python/2.7/site-packages/reportlab/platypus/flowables.py", line 94, in _drawOn self.draw()#this is the bit you overload File "/Library/Python/2.7/site-packages/pypdfocr/pypdfocr_pdf.py", line 72, in draw Paragraph.draw(self) File "/Library/Python/2.7/site-packages/reportlab/platypus/paragraph.py", line 1717, in draw self.drawPara(self.debug) File "/Library/Python/2.7/site-packages/reportlab/platypus/paragraph.py", line 2093, in drawPara blPara = self.blPara

Have followed all instructions to install deps using brew etc....

rmspeers commented 6 years ago

I see this same issue but it only happens on some PDFs. I am including a (very similar) stack trace below.

Traceback (most recent call last):
  File "/usr/local/bin/pypdfocr", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 492, in main
    script.go(sys.argv[1:])
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 474, in go
    self._convert_and_file_email(self.pdf_filename)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 480, in _convert_and_file_email
    ocr_pdffilename = self.run_conversion(pdf_filename)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 363, in run_conversion
    ocr_pdf_filename = self.pdf.overlay_hocr_pages(img_dpi, hocr_filenames, pdf_filename)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 145, in overlay_hocr_pages
    text_pdf_filename = self.overlay_hocr_page(dpi, hocr_filename, img_filename)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 245, in overlay_hocr_page
    self.add_text_layer(pdf,hocr_basename,pg_num,height,dpi)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 349, in add_text_layer
    para.drawOn(pdf, x*72/dpi, height - y*72/dpi)
  File "/usr/local/lib/python2.7/dist-packages/reportlab/platypus/flowables.py", line 113, in drawOn
    self._drawOn(canvas)
  File "/usr/local/lib/python2.7/dist-packages/reportlab/platypus/flowables.py", line 94, in _drawOn
    self.draw()#this is the bit you overload
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 72, in draw
    Paragraph.draw(self)
  File "/usr/local/lib/python2.7/dist-packages/reportlab/platypus/paragraph.py", line 1717, in draw
    self.drawPara(self.debug)
  File "/usr/local/lib/python2.7/dist-packages/reportlab/platypus/paragraph.py", line 2093, in drawPara
    blPara = self.blPara
AttributeError: RotatedPara instance has no attribute 'blPara'
rmspeers commented 6 years ago

CC @virantha is this something you could take a look at?

This is not localized to MacOS, but seems to apply on Ubuntu as well at least. I believe this is not related to the OS based on the trace.

I believe self.wrap() needs to be called up higher but am unsure where or with which arguments.

mrpg commented 6 years ago

I'm having the same problem on Arch Linux. From looking at the sources, no immediate fix comes to mind. Unfortunately, it also seems like this project is no longer being actively maintained.

Luuk3333 commented 6 years ago

On Debian 9.5 I am too seeing a very similar stack trace:

  File "/usr/local/bin/pypdfocr", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 492, in main
    script.go(sys.argv[1:])
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 474, in go
    self._convert_and_file_email(self.pdf_filename)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 480, in _convert_and_file_email
    ocr_pdffilename = self.run_conversion(pdf_filename)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 363, in run_conversion
    ocr_pdf_filename = self.pdf.overlay_hocr_pages(img_dpi, hocr_filenames, pdf_filename)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 145, in overlay_hocr_pages
    text_pdf_filename = self.overlay_hocr_page(dpi, hocr_filename, img_filename)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 245, in overlay_hocr_page
    self.add_text_layer(pdf,hocr_basename,pg_num,height,dpi)
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 349, in add_text_layer
    para.drawOn(pdf, x*72/dpi, height - y*72/dpi)
  File "/usr/local/lib/python2.7/dist-packages/reportlab/platypus/flowables.py", line 113, in drawOn
    self._drawOn(canvas)
  File "/usr/local/lib/python2.7/dist-packages/reportlab/platypus/flowables.py", line 94, in _drawOn
    self.draw()#this is the bit you overload
  File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 72, in draw
    Paragraph.draw(self)
  File "/usr/local/lib/python2.7/dist-packages/reportlab/platypus/paragraph.py", line 1717, in draw
    self.drawPara(self.debug)
  File "/usr/local/lib/python2.7/dist-packages/reportlab/platypus/paragraph.py", line 2093, in drawPara
    blPara = self.blPara
AttributeError: RotatedPara instance has no attribute 'blPara'

So far every file I have tested resulted in this error.

sirdavidwong commented 5 years ago

If you take a look at the release history of reportlab, this lines up with the start of this issue. I downgraded report lab and got it working.

https://pypi.org/project/reportlab/#history

I only tried 3.4.0 and it worked, and didn't keep going.

Luuk3333 commented 5 years ago

Can confirm it's working with reportlab 3.4.0. I also tried 3.5.0, 3.5.1, 3.5.2, and 3.5.4 without success (same error as above).

douglascrp commented 5 years ago

Same problem here on Ubuntu 16.04, and the reportlab downgrade to 3.4.0 fixed it. I had to uninstall the installed version, and them installed the old version, as follow:

pip uninstall reportlab
pip install reportlab==3.4.0
f0rdprefect commented 5 years ago

Can confirm this fix, too. So big question is what changed in reportlab which rendered this useless. @christmasjumper you should consider renaming your issue

amixedcolor commented 2 years ago

in reportlab/platypus/paragraph.py version 3.5.59

I comment-outed below lines, line1803 to 1807. Then I could use paragraph!

def wrap(self, availWidth, availHeight):
    # if availWidth<_FUZZ:
    #     #we cannot fit here
    #     return 0, 0x7fffffff
    # work out widths array for breaking
amixedcolor commented 2 years ago

_FUZZ is assigned in reportlab/rl_settings, it's 1e-6. I use wrap function via table.wrapOn, through the think of it, the second argument of wrap, availWidth, is the coordinate of table, in most of the explains. Maybe so, it cause issue.

BobStein commented 1 year ago

This was happening on p.drawOn() after I had called p.wrap() with availWidth=0. In my case the paragraph text was an empty string anyway, which might have contributed to the zero width. This condition was silently ignored in version 3.4.0. Causes this freaky error with 3.6.12:

File "/usr/local/lib/python3.10/dist-packages/reportlab/platypus/paragraph.py", line 2464, in drawPara
  blPara = self.blPara
AttributeError: 'Paragraph' object has no attribute 'blPara'

Thanks @amixedcolor for your explorations, which helped me get to the bottom of this.