tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Proposal- run in Toolserver http://tools.wmflabs.org #7

Open jayantanth opened 8 years ago

jayantanth commented 8 years ago

Hi Shrini,

This is a proposal to run this script from http://tools.wmflabs.org, so it will be OS independent.

ravidreams commented 8 years ago

+1 for running in the cloud. Will also solve storage, bandwidth and disconnection issues.

tshrinivasan commented 8 years ago

Do we get shell access in tools server?

Or do we need any web interface to to hosted in tools server?

How to ask for access? On 28 Dec 2015 16:14, "ravidreams" notifications@github.com wrote:

+1 for running in the cloud. Will also solve storage, bandwidth and disconnection issues.

— Reply to this email directly or view it on GitHub https://github.com/tshrinivasan/OCR4wikisource/issues/7#issuecomment-167596564 .

jayantanth commented 8 years ago

Hi Shrini, please go through at http://tools.wmflabs.org/

follow the step one by one

Useful links

Tools project page on wikitech (find out more about the Tools project)
Create a Labs account (you must have a Labs account to access the Tools project)
Add a public SSH key (you’ll need this to access Labs servers using SSH)
Request access to the Tools project (Join us!)
Create New Tool
Source code repository of this web

On Facebook chat I was mentioned you, that English Wikisource use at https://tools.wmflabs.org/phetools/hocr_cgi.py , the user PHE ( Philippe Elie) maintain this tool and all his script can be found here https://github.com/phil-el/phetools, this user mostly active in french Wikisource, here is his users page https://fr.wikisource.org/wiki/Utilisateur:Phe

Full help can be found at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs and https://wikitech.wikimedia.org/wiki/Help:Access

omshivaprakash commented 8 years ago

+1 Agree with Jayanta's input.

bodhisattwawiki commented 8 years ago

+1 Totally agree, it will solve the bandwidth issue

tshrinivasan commented 8 years ago

We need a web version of OCR4Wikisource to run on tools server.

Looking for volunteers to make a web version.

samwilson commented 8 years ago

@tshrinivasan I'm not very familiar with the operation of OCR4wikisource, but could http://tools.wmflabs.org/ws-google-ocr/ be modified to help you?

tshrinivasan commented 8 years ago

@samwilson Thanks for the link. The tool you mentioned is for single image.

But in OCR4Wikisource, we can give the URL of a full PDF from commons. It downloads the pdf, splits into single pages, uploads to google drive, download as text, paste the content to relevant wikisource proofread page.

Looking for a web version. https://github.com/tshrinivasan/OCR4wikisource/issues/89

omshivaprakash commented 8 years ago

@samwilson tried to run a file with it for Kannada (kn)

Image size of https://upload.wikimedia.org/wikipedia/commons/4/46/%E0%B2%95%E0%B2%A8%E0%B3%8D%E0%B2%A8%E0%B2%A1_%E0%B2%AD%E0%B2%A4%E0%B3%83%E0%B2%B9%E0%B2%B0%E0%B2%BF_%E0%B2%B8%E0%B3%81%E0%B2%AD%E0%B2%BE%E0%B2%B7%E0%B2%BF%E0%B2%A4.djvu (11421144) exceeds permitted size (4194304)

Looks like there is some limit on the memory usage. Please check.

samwilson commented 8 years ago

The Vision API is limited to 4 MB per image.

I'm replying with some other thoughts in https://phabricator.wikimedia.org/T120788

bodhisattwawiki commented 7 years ago

Niharika, Rohit and Psychoslave started working on this during Wikimania 2016 hackathon, but no update after that. https://tools.wmflabs.org/?tool=ocr4wikisource