tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Check for already OCRed and uploaded books #56

Closed ravidreams closed 3 months ago

ravidreams commented 8 years ago

As we are getting more hands to use this tool, we are observing two people trying to upload the same book or uploading already existing books. This results in two issues:

  1. Unnecessary processig time and then user wondering why his edits are not seen.
  2. Some pages being overwritten when there are small differences in the OCR output from Google. Srikanth Lakshmanan faced this issue. Since this is minor variance and not 100% better always, it is not worthy to overwrite the pages again with a new OCR text.

I suggest that before do_ocr.py runs, it should first check if a page like Page:Book-name.pdf/1 exists. If it exists, it should inform the user and suggest to try another book.

jayantanth commented 8 years ago

@ravidreams please think about other Indic wikisource regarding this issue. In most of the other wikisource have started manual proofread with page 1, 2, 3... but not completed. As of now no proofreading stats is zero. So this is ok for TAWS to check page 1, not for all other wiki.

You need to co-ordinate with each other. In BNWS we 3 people working on OCR, we have discussed each other, which set will do by whom.

ravidreams commented 8 years ago

@jayantanth In future, the tool can be used by anyone for any book in any language wikisource. So, all time coordination is not possible and the tool should eliminate manual errors.

Will it be OK if we check for last page of the book instead of the first page? But, then this check can happen only when the book is downloaded and sliced into single pages. So, before OCR starts this should be checked as doing OCR is the time consuming part. The tool can still can give an option to continue doing OCR if the user is sure about what (s)he is doing.

Or, please suggest any other logic which will avoid duplicate effort and overwriting of existing OCRed pages.

ravidreams commented 8 years ago

Also, this check is important when tool goes to the next step of OCRing multiple files at a go instead of changing config.ini every time for the next book.

jayantanth commented 8 years ago

@ravidreams Agreed with you.

bodhisattwawiki commented 8 years ago

@ravidreams, Purging the index file after OCR completion will make the file red in the list of Index pages. Thus, users can easily check the list of index files about the status of OCR in a Index file. (For example, in Bengali Wikisource, https://bn.wikisource.org/w/index.php?title=%E0%A6%AC%E0%A6%BF%E0%A6%B6%E0%A7%87%E0%A6%B7:IndexPages&limit=500&offset=0&key=&order= )

Issue #74 - Purge the index file after OCR is completed