tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Page order issue #29

Closed jayantanth closed 8 years ago

jayantanth commented 8 years ago

Please look at the screen-shot, I have skipped page 2 by pressing Ctrl+C.

text_for_page_00002.txt created from the content of Page No 3 text_for_page_00003.txt created from the content of Page No 4 text_for_page_00004.txt created from the content of Page No 5 text_for_page_00005.txt not created

screenshot from 2016-01-14 09 26 46

do_ocr_2016-01-14-09-21-48_log.txt

jayantanth commented 8 years ago

Proposal: need one script to rename.

jayantanth commented 8 years ago

Found one script

j=99;for i in *.txt; do mv "$i" text_for_page_000"$j".txt; let j=j+1;done

if start page 99

tshrinivasan commented 8 years ago

Will work on this from monday On 16 Jan 2016 16:36, "Jayanta Nath" notifications@github.com wrote:

Found one script

j=99;for i in *.txt; do mv "$i" text_for_page_000"$j".txt; let j=j+1;done

if start page 99

— Reply to this email directly or view it on GitHub https://github.com/tshrinivasan/OCR4wikisource/issues/29#issuecomment-172222938 .

jayantanth commented 8 years ago

1

jayantanth commented 8 years ago

I am sharing my all test result about this issue. On first run, it create disorder( 1,2,3,4) page if interrupt by user (Cnrl+C). Just next run only remaining pages uploaded to GD and match the perfect order ( 1,2,3,4,5) .

But if I leave the machine to run script automatically, we never know when uploading was stuck. Specially in night I have been using do_ocr and leave it, on morning I was watching that stuck at 55 page or 250 page. So my full process was lost if I am not interrupt by Cnrl+C. So I have to awake to watch when stuck at uploading. So if the script automatically skipped to next page when stuck to upload at GD, that will very helpful for us.

And finally If the above screenshot issue need to be fixed, after final run of do_ocr.py of create all txt files ( ie 1,2,3,4,5), all pdf, log, .upload files should move to temp folder.

jayantanth commented 8 years ago

Tested about 25 books. Now fee that this issue is most needed. As I mentioned that Page No 2 Should not be present at first run. Because in next run, sometimes not re-order properly.

jayantanth commented 8 years ago

Hi I have observed that page_0001.txt, page_0002.txt have created proper order, means , If I have skipped/or not done by any reason of page 2, the following pages are created.

page_0001.txt, page_0003.txt page_0004.txt, page_0005.txt

tshrinivasan commented 8 years ago

Sorry for the long delay on this project.

Resumed my works to fix the issues on this.

tshrinivasan commented 8 years ago

Fixed the skipping uploads in version 1.38

No of individual pages should be equal to the no of relevant text files.

If we skip manually or automatically on the upload process, it wont proceed further.

We have to rerun the script to upload the pending files.

It will process the text files only after all the PDF files are uploaded and received their text content.

Check this and share the results.

jayantanth commented 8 years ago

do_ocr_2016-02-03-22-45-51_log.txt do_ocr_2016-02-04-09-08-59_log.txt do_ocr_2016-02-04-09-09-52_log.txt

I have run again two times , but every times said that,

=========ERROR===========

INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

tshrinivasan commented 8 years ago

Yes.

It means, few PDF files are not uploaded and not received their text files.

Run again and again until the error is gone.

jayantanth commented 8 years ago

Sorry Shrini, how many times I have to re-run ? I have re-ran about 5 times nothing happened,
no file was trying to upload at GD.

jayantanth commented 8 years ago

I have manually checked that only five files missed. This is 1167 pages book, only 1162 pages OCRed.

tshrinivasan commented 8 years ago

Is rerunning uploads the missing 5 files?

Text splitting won't run until the no of PDF is equal to no of text files received.

Just to make sure that no page is missed to ocr.

Rerun few more times and watch if missed files are being uploaded.

jayantanth commented 8 years ago

So I have done few things manually, copy all "page_00001.txt" to new folder, that just batch rename to text_for_page_00001.txt, then run "python mediawiki_uploader.py" to upload to wikisource. Rest of 5 files will done by Manually. :-(

screenshot from 2016-02-04 21 47 40

jayantanth commented 8 years ago

Ok I have tried again about seven times, nothing was happend , the remaining 5 file should trying to upload to GD. My internet connection is OK during that time.

do_ocr_2016-02-04-21-30-08_log.txt do_ocr_2016-02-04-21-54-27_log.txt do_ocr_2016-02-04-21-54-52_log.txt do_ocr_2016-02-04-21-55-15_log.txt do_ocr_2016-02-04-21-55-37_log.txt do_ocr_2016-02-04-21-56-04_log.txt do_ocr_2016-02-04-21-57-24_log.txt do_ocr_2016-02-04-21-58-31_log.txt

but every times said that,

=========ERROR===========

INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

tshrinivasan commented 8 years ago

if you are online now, can you show this issue by screensharing?

jayantanth commented 8 years ago

using v1.42 , after run of 708 pages book, the message has come.

=========ERROR===========

INFO:main:Missing page_00064.txt INFO:main:page_00064.pdf should be reuploaded INFO:main:Missing page_00420.txt INFO:main:page_00420.pdf should be reuploaded INFO:main:Missing page_00493.txt INFO:main:page_00493.pdf should be reuploaded INFO:main:Missing page_00544.txt INFO:main:page_00544.pdf should be reuploaded INFO:main:Missing page_00627.txt INFO:main:page_00627.pdf should be reuploaded INFO:main:

Text files are not equal to PDF files. Some PDF files not OCRed. Run this script again to complete OCR all the PDF files

"THIS IS GREAT" THAT WAS MY WISH :+1:

after second run , only remaining file was uploaded and ocred.

Moving all temp files to OCR-স্ত্রী-রোগ.djvu-temp-2016-02-07-00-33-32

INFO:main:Running mv folder_.log currentfile.pdf docdata.txt pg.pdf page* txt* 'OCR-স্ত্রী-রোগ.djvu-temp-2016-02-07-00-33-32' INFO:main:

Done. Check the text files start with text_forpage INFO:main:

The PDF files and result text files are equval. Now run the mediawiki_uploader.py script

"THIS IS GREAT" MY WISH FULFILL :+1: :+1: :+1: do_ocr_2016-02-06-00-38-44_log.txt do_ocr_2016-02-07-00-33-32_log.txt

tshrinivasan commented 8 years ago

Shall we close this issue?

Can you check for other related reported issues for closing them too?