virantha / pypdfocr

Python script to do PDF OCR conversion using Tesseract
Apache License 2.0
372 stars 114 forks source link

xml parse error on converting a non-searchable pdf to searchable pdf #8

Closed rajatdutta closed 10 years ago

rajatdutta commented 10 years ago

Hi , I am trying to convert a PDF and encounter the following error :

pypdfocr -d test.pdf Starting conversion of test.pdf Running ghostscript on test.pdf to create test.tiff gs -q -dNOPAUSE -sDEVICE=tiff24nc -r300 -sOutputFile="test.tiff" "test.pdf" -c quit Created test.tiff gs -q -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=75 -r200 -sOutputFile="test%d.jpg" "test.pdf" -c quit Running OCR on test.tiff to create test.html tesseract "test.tiff" "test" hocr Tesseract Open Source OCR Engine v3.02 with Leptonica Page 0 Page 1 Created test.html hocr_filename:test.html, hocr_dir:, hocr_basename:test.html Overlaying hocr and creating final test_ocr.pdf Analyzing OCR and applying text to PDF... Searching for test*.jpg Adding page image test1.jpg Page width=612.000000, height=792.000000 Adding text to page 1 Traceback (most recent call last): File "/usr/local/bin/pypdfocr", line 8, in load_entry_point('pypdfocr==0.5.4', 'console_scripts', 'pypdfocr')() File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 399, in main script.go(sys.argv[1:]) File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 389, in go ocr_pdffilename = self.run_conversion(self.pdf_filename) File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 300, in run_conversion ocr_pdf_filename = self.pdf.overlay_hocr(tiff_dpi, hocr_filename) File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 95, in overlay_hocr self.add_text_layer(pdf, hocr_basename,pg_num,height,dpi) File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 131, in add_text_layer hocr.parse(hocrfile) File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse parser.feed(data) File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed self._raiseerror(v) File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror raise err xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 233, column 1051

I have checked the HTML generated and yes there is a special character on the line mentioned but how do I handle that ? or any better solution ?

virantha commented 10 years ago

Can you send me an example pdf that generates this error? I'll take a look, but this error is from an external xml library, so it might not be trivial to work around.

rajatdutta commented 10 years ago

Thanx alot for a quick reply .Here is the link for that test pdf : "http://www.filedropper.com/test_20"

virantha commented 10 years ago

Hmm, that file works fine for me on Mac OS X. Can you tell me which OS, and Tesseract version you're using?

rajatdutta commented 10 years ago

I am on Linux(Ubuntu 12.10) "Tesseract Open Source OCR Engine v3.02 "

virantha commented 10 years ago

Ok I'm going to have to boot up a VM for this later today; I'll let you know how it goes. Can you also paste in a few line around the offending character in your html/hocr file?

rajatdutta commented 10 years ago

Following is the html snippet where it fails :

#####################################################

DATE or POSTING: ll 2 1 /2 g / <90>gk^B . 1â<80><9d> ,2â<80><99> '- Gg^B 15-"? â<80><9d> "57

link to a test html file with special chars : "http://www.filedropper.com/check" #####################################################

virantha commented 10 years ago

Can you run 'tesseract -v' and confirm that you are running 3.02.02? Earlier versions, including 3.02.01 and 3.02 can generate invalid characters in the hocr.

On Wed, Jan 15, 2014 at 1:26 PM, rajatdutta notifications@github.comwrote:

Following is the html snippet where it fails :

#####################################################

DATE or POSTING: ll 2 1 /2 g / gk^B . ,2â '- Gg^B\ 15-"? â "57

#####################################################

— Reply to this email directly or view it on GitHubhttps://github.com/virantha/pypdfocr/issues/8#issuecomment-32393619 .

rajatdutta commented 10 years ago

My version of tesseract version is 3.02 .I'll try updating it to 3.02.02 and then process the PDF .

virantha commented 10 years ago

Any updates? I can't reproduce running on Windows 7 64-bit on your test pdf; everything seems fine on my end.

rajatdutta commented 10 years ago

Hi , It worked after I changed the tesseract version to 3.02.02 Thankx for the help.

On Friday, January 17, 2014, virantha notifications@github.com wrote:

Any updates? I can't reproduce running on Windows 7 64-bit on your test pdf; everything seems fine on my end.

— Reply to this email directly or view it on GitHubhttps://github.com/virantha/pypdfocr/issues/8#issuecomment-32579116 .

virantha commented 10 years ago

Great! I'll mark this issue as closed now.