tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Is it possible to automate the making of OCR4wikisource folder from a list of File URL? #34

Open tha-uzhavan opened 8 years ago

tha-uzhavan commented 8 years ago

The following link is going to populate morethan 2200 PDF files. https://commons.wikimedia.org/wiki/Category:The_PDF_files_in_Tamil_without_OCR_conversion If i make a list of the File URLs, is it possible to automate the OCR process. It is necessary, because every time i am changing the 'file_url' only.

tshrinivasan commented 8 years ago

This is in future roadmap.

Once the OCR started to run without a single issue, we can automate it completely.

Now, we have to verify for script to run completely and run mediawiki_uploader .py manually.

Run some few hundreds books manually. When there is no single issue on execution, we cab automate it.

tha-uzhavan commented 8 years ago

There is no doubt that i am going to upload morethan 10,000pdf files with CC license. My main aim is to help people who are working to build a huge Tamil corpus than English corpus. With out corpus, nothing is possible in computational linguistics. And also, who knows that when the google service going to stop?

jayantanth commented 8 years ago

@tha-uzhavan 10,000 (ten thousand) book? this is great. Have you any one check all books whether all pages are ok or not? If found later one page missing or duplicate page, that will be the future problem during proofreading