Open tha-uzhavan opened 8 years ago
This is in future roadmap.
Once the OCR started to run without a single issue, we can automate it completely.
Now, we have to verify for script to run completely and run mediawiki_uploader .py manually.
Run some few hundreds books manually. When there is no single issue on execution, we cab automate it.
There is no doubt that i am going to upload morethan 10,000pdf files with CC license. My main aim is to help people who are working to build a huge Tamil corpus than English corpus. With out corpus, nothing is possible in computational linguistics. And also, who knows that when the google service going to stop?
@tha-uzhavan 10,000 (ten thousand) book? this is great. Have you any one check all books whether all pages are ok or not? If found later one page missing or duplicate page, that will be the future problem during proofreading
The following link is going to populate morethan 2200 PDF files. https://commons.wikimedia.org/wiki/Category:The_PDF_files_in_Tamil_without_OCR_conversion If i make a list of the File URLs, is it possible to automate the OCR process. It is necessary, because every time i am changing the 'file_url' only.