tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Index making is necessary for the notes by the uploading person #77

Closed tha-uzhavan closed 8 years ago

tha-uzhavan commented 8 years ago

Before upload the text to wikisource, Index making is necessary for the notes by the uploading person. For example, see the index https://ta.wikisource.org/s/spz . From the 35th page, page rotation is needed. And also to patrol the uploading. URL notes Page:உமர்_கயாம்_வாழ்வும்_இலக்கியமும்.pdf/35 (for a page) Index:{{PAGENAME}} (for a book)

bodhisattwawiki commented 8 years ago

Other Wikisource including Bengali creates Index pages right after the file is uploaded and after that we go for OCR. This is the standard procedure. What Tamil is doing is not standard procedure. As @Ravidreams said earlier, Tamil is not creating Index pages first. They are doing the OCR and then they are creating Index pages. I think, what @tha-uzhavan is asking, is a Tamil specific need amd not at all general need. So, I do not think, the script needs this feature at all.

tshrinivasan commented 8 years ago

Can anyone explain the workflow after uploading a pdf to commons?

How the file in commons is displayed in wiki source with index?

I heard about the proofread extension. but like to know how the index pages are generated in other languages and why it is not there in tamil wikisource?

bodhisattwawiki commented 8 years ago

Can anyone explain the workflow after uploading a pdf to commons?

The workflow is like this - 1) Upload file to Commons 2) Create index page in Wikisource 3) Check whether the file has missing pages, duplicate pages, disoriented pages and give page numbers at the Index page 4) Start OCR 5) Proofread 6) Validate

jayantanth commented 8 years ago

Some times at step 3. we use to do offline in our own PC before uploading to commons.

You can find more detailed at https://en.wikisource.org/wiki/Help:Beginner's_guide_to_proofreading

tha-uzhavan commented 8 years ago

I agree with Bodhi, My point is Index making should be automate. For example, first, the following infos are enough to create a index page. https://ta.wikisource.org/w/index.php?title=Index:%E0%AE%AA%E0%AF%86%E0%AE%B0%E0%AE%BF%E0%AE%AF_%E0%AE%AA%E0%AF%81%E0%AE%B0%E0%AE%BE%E0%AE%A3%E0%AE%AE%E0%AF%8D_%E0%AE%93%E0%AE%B0%E0%AF%8D_%E0%AE%86%E0%AE%AF%E0%AF%8D%E0%AE%B5%E0%AF%81-2.pdf&action=edit then other infos from the commons description page of the book if available. we are going to do by outreach programme. so, Index making and maintanace can be done by automation but through simple steps. I think most of the Indian wikisource projects are doing manually. Of course, it is good but in the future maintenance automation is best.

ravidreams commented 8 years ago

It is better to keep this tool focused only on OCR to keep the suited for standard Wikisource practices.

However, there is a need to automate index creation when we do bulk file uploads. So, this feature can be added to the pdf upload tool. Reported here - https://github.com/tshrinivasan/tools-for-wiki/issues/12

tshrinivasan commented 8 years ago

Created index maker for all files in a given category https://github.com/tshrinivasan/tools-for-wiki/tree/master/index-maker