tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

A wiki-markup is to be added #40

Closed tha-uzhavan closed 8 years ago

tha-uzhavan commented 8 years ago

When we upload text to wikisource, we have to keep sentence alignments as in the google out put file. For that, in text_For_page, the wiki markup <poem> as the 2nd line and also </poem> is to be added before the last line. Because the first and last lines are column headers. Then only, we get clear view as in the image.

For example, see https://ta.wikisource.org/s/f75 and then go to edit and remove the wiki markup and then preview. You can understand the need of insertion of the code. The code <poem> is not only for poetry and also prose.

tshrinivasan commented 8 years ago

is it common for all indian languages?

@jayantanth @BodhisattwaMandal @omshivaprakash Share your thoughts.

bodhisattwawiki commented 8 years ago

I dont think, we need to add specific wiki mark-up like using this script at least in Bengali Wikisource. First of all is not at all used in prose, secondly, when we do the manual proofreading, we add all the wiki mark-ups and templates as per need. So, its not a big issue for us. Besides, this is not an OCR issue.

jayantanth commented 8 years ago

Hi, @tha-uzhavan , Please don't mesh this script with Proofreding. The main task of this script is OCR the page properly and upload to wikisource. The Wiki formatting for Proofreding is a different task, where must need a human touch to every page.

tshrinivasan commented 8 years ago

if adding and improves the readability, and gives proper line breaks, we can add via upload script easily.

But want to know if all the indic language wiki sources use same tag or something else.

if adding poem tag is an easy fix for better readability, until someone proofread and fixes, I think it is better to add.

just sharing my thoughts after seeing this page. https://ta.wikisource.org/s/f75

it had poem tags and looks good. when removed poem tags and previewed, they all look like one single para.

Anyhow, I dont have much experience in editing wikisource. Will do what you all decide.

Share your thoughts.

bodhisattwawiki commented 8 years ago

I dont think, its needed to add this wiki-mark up to the OCR script. I have already stated that earlier. is not used in every pages, only in poems. We dont use it in Bengali Wikisource in every page while proofreading. We can add that manually or with AWB if needed. Its not at all a general need. I thing, its not a major issue and the discussion can be closed now.

ravidreams commented 8 years ago

I agree with Info-farmer that this improves readability. This is important when 1000s of pages are going to be left not proofread for years. But, I am also not comfortable with a specific tag like poem used for prose.

Use of pre tag adds a box around the text. Is there any other tag that will achive the same effect? Or, can we duplicate the poem tag?

bodhisattwawiki commented 8 years ago

Ravi, There are so many wiki mark-ups used in WS, other than , but all has their definite purpose. You can add them using AWB or using another bot while proof-reading as needed. My personal opinion is, we are diverting from the OCR tool issue by prolonging this discussion.

ravidreams commented 8 years ago

ok ok.. cool :) Let's close this. The tool needs to serve general needs instead of over-customization.

tshrinivasan commented 8 years ago

Will create a separate script to mass edit and specific tags.

tshrinivasan commented 8 years ago

Will create a tool for tawiksource here : https://github.com/tshrinivasan/tools-for-wiki/issues/11