tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

empty text #99

Closed bodhisattwawiki closed 6 years ago

bodhisattwawiki commented 6 years ago

OCR of this book delivered empty text. I am using Ubuntu 16.04. Same happened with Ravi's fork.

bodhisattwawiki commented 6 years ago

Installed this script in another computer with Ubuntu 14.04 and ran ocr for this test file. Still it is delivering empty text.

jayantanth commented 6 years ago

do_ocr_2018-02-25-02-35-19_log.txt all_text_for_Testocrbengali.djvu.txt

Same issue from my machine.

jayantanth commented 6 years ago

I have checked Bengali and Tamil book, its not export txt file properly, means empty txt file. Please fix this issue ASAP. Otherwise OCR job have stopped.

jayantanth commented 6 years ago

For Bengali Wikisource the pdf file is not working. If we use image file(png, jpg,jpeg) instead of PDF, Its working fine.

jayantanth commented 6 years ago

@tshrinivasan please create the script for JPG ans I will test for small PDF/DJVU file. When I use Imagemagick convert command in 50MB file its crash. but below 50MB its convert well. I am trying to here https://github.com/jayantanth/OCR4wikisource/blob/master/doocr.py , but its not.

tshrinivasan commented 6 years ago

I am also trying with imagemagick.

But, it crashes the system.

Trying with adding tmedelay for each page conversion.

Will update soon.

2018-03-16 14:01 GMT+05:30 Jayanta Nath notifications@github.com:

@tshrinivasan https://github.com/tshrinivasan please create the script for JPG ans I will test for small PDF/DJVU file. When I use Imagemagick convert command in 50MB file its crash. but below 50MB its convert well. I am trying to here https://github.com/jayantanth/ OCR4wikisource/blob/master/doocr.py , but its not.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tshrinivasan/OCR4wikisource/issues/99#issuecomment-373640257, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNbOG2_wkOfFed3Gp9Wx586yoDply9vks5te3h2gaJpZM4SRyGV .

-- Regards, T.Shrinivasan

My Life with GNU/Linux : http://goinggnu.wordpress.com Free E-Magazine on Free Open Source Software in Tamil : http://kaniyam.com

Get Free Tamil Ebooks for Android, iOS, Kindle, Computer : http://FreeTamilEbooks.com

jayantanth commented 6 years ago

Hi @tshrinivasan could you please try

gs -q -DNOPAUSE -DBATCH -r400 -SDEVICE=a4 -sOutputFile=abcd%d.jpg abcd.pdf

Rate 20 page/Min but not system crashed of 200 mb file for me.

tshrinivasan commented 6 years ago

gs -q -DNOPAUSE -DBATCH -r400 -SDEVICE=a4 -sOutputFile=a%d.jpg a.pdf
Unknown device: a4 ./base/gsicc_manage.c:1088: gsicc_open_search(): Could not find default_gray.icc | ./base/gsicc_manage.c:1708: gsicc_set_device_profile(): cannot find device profile Unrecoverable error: unknownerror in .special_op Operand stack: defaultdevice Unrecoverable error: undefined in .uninstallpagedevice Operand stack: defaultdevice

Got above error.

But converting to png works

gs -q -DNOPAUSE -DBATCH -r400 -SDEVICE=png16m -sOutputFile=a%d.png a.pdf

this is working fine.

Will test more for performance and crashing.

jayantanth commented 6 years ago

Sorry the right code is below

gs -q -DNOPAUSE -DBATCH -r400 -SDEVICE=jpeg -sPAPERSIZE=a4 -sOutputFile=abcd%d.jpg abcd.pdf

tshrinivasan commented 6 years ago

Great.

This works nicely.

Will incorporate this to do_ocr.py

bodhisattwawiki commented 6 years ago

Any update?

tshrinivasan commented 6 years ago

Added the gs commands given to convert the pdf files to jpg files.

do_ocr_jpg.txt

download the above file.

mv do_ocr_jpg.txt do_ocr_jpg.py

and run the do_ocr_jpg.py file and share the results.

tshrinivasan commented 6 years ago

do_ocr_jpg.py-v2.txt

use this file. fixed few spell errors.

mv do_ocr_jpg.py-v2.txt do_ocr_jpg-v2.py

and run the do_ocr_jpg-v2.py file and share the results.

Shreeshrii commented 6 years ago

with v2 version of script

Downloading the OCRed text

INFO:main:Running gdput.py -t ocr -f 1sEdaoQ2YBzXcckYFqK_JQYXYx_cFU1MW page_00118.jpg | tee page_00118.log sed: -e expression #1, char 12: unterminated `s' command

tshrinivasan commented 6 years ago

do_ocr_jpg.py-v3.txt

try this file. removed the feature of adding one more empty lines between paras. Let us fix it later.

mv do_ocr_jpg.py-v3.txt do_ocr_jpg-v3.py

and run the do_ocr_jpg-v3.py file and share the results.

Thanks @Shreeshrii for the tests. Found the OCR is not working for PDF and it works for JPG. In this issue, I am converting the PDF to JPG to get OCR running and to give results.

Check for the OCR gives text and not empty pages for the languages you know.

Once this is fixed, we can work on optimal file io.

Test v3 and share the CR results.

Shreeshrii commented 6 years ago

Thanks. I will try with new version.

Shreeshrii commented 6 years ago

Found the OCR is not working for PDF and it works for JPG.

Yes, I was getting OCR files of 3 bytes when converting with pdf and was wondering what was wrong, and found your posted solution here. Thanks for your prompt response.

tshrinivasan commented 6 years ago

Google is removing PDF support for OCR. Hence trying with JPG.

Let us work on one problem at a time.

Test the OCR workability in this issue.

Raise a new issue for filenames with space.

Thanks.

Shreeshrii commented 6 years ago

I have checked with files in Devanagari and Gujarati scripts. Conversion is happening from jpg.

With version 3, no sed error. Output file is greater than 3 bytes :)

Downloading the OCRed text

INFO:__main__:Running gdput.py -t ocr -f   1z5wcJsHDFYd16UQhs28U5cfUEbGV8JkN page_00002.jpg | tee page_00002.log

File location: /home/sanskrit/OCR4wikisource/page_00002.txt
File size in bytes: 3044
INFO:__main__:
  Creating temp file touch page_00002.upload

INFO:__main__:
tshrinivasan commented 6 years ago

merged do_ocr_jpg.py-v3.txt with do_ocr.py and fixed this issue