tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Handling existing pages #17

Open ravidreams opened 8 years ago

ravidreams commented 8 years ago

I intentionally tried uploading text for pages that already exist as many test books are having partial proofread activity.

It gives the following message:

Moving the file text_for_page_00010.txt to the folder temp-2016-01-04-19-09-23

Uploading content for text_for_page_00011.txt Uploaded at https://ta.wikisource.org/wiki/Page:கலைக்_களஞ்சியம்_அம்மாலன-அரேபியா.pdf/11


Whereas, it just skips the page if it is already there. May be the message should read like:

Page https://ta.wikisource.org/wiki/Page:கலைக்_களஞ்சியம்_அம்மாலன-அரேபியா.pdf/11 already exists. Skipping upload.


I am still not sure if a page overwrite is possible and an option can be given for that if we are repeating uploads after an error (page number variation).

\

Consider this least priority as the tool is working anyway ;)

tshrinivasan commented 8 years ago

Do you want to do the following?

  1. check for wikisource page for existing.
  2. if already existing with some content, dont upload.
  3. if not already there, upload the content

Do you mean these or something else?

ravidreams commented 8 years ago

It is already not uploading when the page is there (or the code / wiki doesn't let overwrite when a page is already there). But the terminal message says, it is uploaded. Just the message needs to be changed.

Shreeshrii commented 6 years ago

I just installed ocr4wikisource and find it is very convenient for updating the OCRed text on wikisource. Thank you for the tool.

I wanted to know whether there is any option which will allow overwriting of the pages already on wikisource.

tshrinivasan commented 6 years ago

It will overwrite as default.

Can you explain with examples?

2017-10-14 9:03 GMT+05:30 Shreeshrii notifications@github.com:

I just installed ocr4wikisource and find it is very convenient for updating the OCRed text on wikisource. Thank you for the tool.

I wanted to know whether there is any option which will allow overwriting of the pages already on wikisource.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tshrinivasan/OCR4wikisource/issues/17#issuecomment-336606680, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNbOPQx9FBKEb8xsMhdIhSExPxc-Jtfks5ssCufgaJpZM4G-Ek4 .

-- Regards, T.Shrinivasan

My Life with GNU/Linux : http://goinggnu.wordpress.com Free E-Magazine on Free Open Source Software in Tamil : http://kaniyam.com

Get Free Tamil Ebooks for Android, iOS, Kindle, Computer : http://FreeTamilEbooks.com

Shreeshrii commented 6 years ago

I uploaded https://commons.wikimedia.org/wiki/File:%E0%A4%B6%E0%A5%8D%E0%A4%B0%E0%A5%80%E0%A4%A4%E0%A4%A4%E0%A5%8D%E0%A4%B5%E0%A4%A8%E0%A4%BF%E0%A4%A7%E0%A4%BF.pdf

and then uploaded the Google Drive OCRed pages using OCR4wikisource linked at https://sa.wikisource.org/wiki/%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%95%E0%A5%8D%E0%A4%B0%E0%A4%AE%E0%A4%A3%E0%A4%BF%E0%A4%95%E0%A4%BE:%E0%A4%B6%E0%A5%8D%E0%A4%B0%E0%A5%80%E0%A4%A4%E0%A4%A4%E0%A5%8D%E0%A4%B5%E0%A4%A8%E0%A4%BF%E0%A4%A7%E0%A4%BF.pdf

However, later I noticed that some pages are in landscape format or skewed. So I want to OCR and upload them again.

Also, I had OCRed a few of these pages on wikisource website using their Google OCR button and was wondering whether they would get overwritten.

Shreeshrii commented 6 years ago

Ok, I found a related problem. When I had generated the index page with <pagelist /> in sa.wikisource, it had generated the page numbers in Devanagari digits.

I edited some pages eg. 1, 107 using the wikisource edit and OCR feature.

When I uploaded pages using OCR4wikisource it created the pagenumbers using 0-9 and not devanagari ०-९.

Hence, there are two versions for page 107 https://sa.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0%E0%A4%AE%E0%A5%8D:%E0%A4%B6%E0%A5%8D%E0%A4%B0%E0%A5%80%E0%A4%A4%E0%A4%A4%E0%A5%8D%E0%A4%B5%E0%A4%A8%E0%A4%BF%E0%A4%A7%E0%A4%BF.pdf/%E0%A5%A7%E0%A5%A6%E0%A5%AD&action=history

https://sa.wikisource.org/w/index.php?title=%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0%E0%A4%AE%E0%A5%8D:%E0%A4%B6%E0%A5%8D%E0%A4%B0%E0%A5%80%E0%A4%A4%E0%A4%A4%E0%A5%8D%E0%A4%B5%E0%A4%A8%E0%A4%BF%E0%A4%A7%E0%A4%BF.pdf/107&action=history

and the page did not get overwritten.