Closed rajatdutta closed 10 years ago
Can you send me an example pdf that generates this error? I'll take a look, but this error is from an external xml library, so it might not be trivial to work around.
Thanx alot for a quick reply .Here is the link for that test pdf : "http://www.filedropper.com/test_20"
Hmm, that file works fine for me on Mac OS X. Can you tell me which OS, and Tesseract version you're using?
I am on Linux(Ubuntu 12.10) "Tesseract Open Source OCR Engine v3.02 "
Ok I'm going to have to boot up a VM for this later today; I'll let you know how it goes. Can you also paste in a few line around the offending character in your html/hocr file?
Following is the html snippet where it fails :
#####################################################
DATE or POSTING: ll 2 1 /2 g / <90>gk^B . 1â<80><9d> ,2â<80><99> '- Gg^B 15-"? â<80><9d> "57
link to a test html file with special chars : "http://www.filedropper.com/check" #####################################################
Can you run 'tesseract -v' and confirm that you are running 3.02.02? Earlier versions, including 3.02.01 and 3.02 can generate invalid characters in the hocr.
On Wed, Jan 15, 2014 at 1:26 PM, rajatdutta notifications@github.comwrote:
Following is the html snippet where it fails :
#####################################################
DATE or POSTING: ll 2 1 /2 g / gk^B . 1â ,2â '- Gg^B\ 15-"? â "57
#####################################################
— Reply to this email directly or view it on GitHubhttps://github.com/virantha/pypdfocr/issues/8#issuecomment-32393619 .
My version of tesseract version is 3.02 .I'll try updating it to 3.02.02 and then process the PDF .
Any updates? I can't reproduce running on Windows 7 64-bit on your test pdf; everything seems fine on my end.
Hi , It worked after I changed the tesseract version to 3.02.02 Thankx for the help.
On Friday, January 17, 2014, virantha notifications@github.com wrote:
Any updates? I can't reproduce running on Windows 7 64-bit on your test pdf; everything seems fine on my end.
— Reply to this email directly or view it on GitHubhttps://github.com/virantha/pypdfocr/issues/8#issuecomment-32579116 .
Great! I'll mark this issue as closed now.
Hi , I am trying to convert a PDF and encounter the following error :
pypdfocr -d test.pdf Starting conversion of test.pdf Running ghostscript on test.pdf to create test.tiff gs -q -dNOPAUSE -sDEVICE=tiff24nc -r300 -sOutputFile="test.tiff" "test.pdf" -c quit Created test.tiff gs -q -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=75 -r200 -sOutputFile="test%d.jpg" "test.pdf" -c quit Running OCR on test.tiff to create test.html tesseract "test.tiff" "test" hocr Tesseract Open Source OCR Engine v3.02 with Leptonica Page 0 Page 1 Created test.html hocr_filename:test.html, hocr_dir:, hocr_basename:test.html Overlaying hocr and creating final test_ocr.pdf Analyzing OCR and applying text to PDF... Searching for test*.jpg Adding page image test1.jpg Page width=612.000000, height=792.000000 Adding text to page 1 Traceback (most recent call last): File "/usr/local/bin/pypdfocr", line 8, in
load_entry_point('pypdfocr==0.5.4', 'console_scripts', 'pypdfocr')()
File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 399, in main
script.go(sys.argv[1:])
File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 389, in go
ocr_pdffilename = self.run_conversion(self.pdf_filename)
File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr.py", line 300, in run_conversion
ocr_pdf_filename = self.pdf.overlay_hocr(tiff_dpi, hocr_filename)
File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 95, in overlay_hocr
self.add_text_layer(pdf, hocr_basename,pg_num,height,dpi)
File "/usr/local/lib/python2.7/dist-packages/pypdfocr/pypdfocr_pdf.py", line 131, in add_text_layer
hocr.parse(hocrfile)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 233, column 1051
I have checked the HTML generated and yes there is a special character on the line mentioned but how do I handle that ? or any better solution ?