tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Google pdf upload limit #27

Closed tshrinivasan closed 8 years ago

tshrinivasan commented 8 years ago

https://support.google.com/drive/answer/176692?hl=en

This page says as

File size limitations

The maximum size for images (.jpg, .gif, .png) and PDF files (.pdf) is 2 MB. For PDF files, we only look at the first 10 pages when searching for text to extract.

But, When tried to upload a 80 page(1.8 MB) pdf manually and open as google doc, it OCRed it and displayed text for full 80 pages.

When tried a 24 MB PDF, it can not open as google doc.

Try with various PDF for wikisource manually and find the page/Size upper limit for PDF files.

@ravidreams @jayantanth @BodhisattwaMandal @omshivaprakash

jayantanth commented 8 years ago

Last month I have finished OCR job manually 600 page book, each page was near about 1MB size PDF format. also I have tries 450 pages book 52 MB, converted all pages to PNG format with 300dpi , each page was 700Kb-800Kb.

tshrinivasan commented 8 years ago

I tried uploading the pdf as single whole file and open with google doc. Not splitting or converting to individual pages or images.

Try uploading as whole files and find if can avoid the process of splitting to individual pages.

jayantanth commented 8 years ago

After using few days, I have an opinion....

  1. Current workflow is so good. a) Download from wkimedia Server b) If File DJVU format convert to PDF c) PDF --> Single page PDF d) Upload PDF one by one to GD e) Download OCRed Text file format one by one. f) Upload OCRed text to Wikisource Page: namespace.

So my opinion this is good, don't go for multiple PDF upload. Because we have upload one by one Page: namespace of Wikisource.

My suggestion/proposal for need some more improvements at

Stage a) If downloaded file already exist at script folder, it should not trying to download again. It should use the existing file to next step.

Stage d) We need option to which page to which page want to upload. page 1-30 or 50-100 etc. And also Ravi and I mention at "Network connection issues" #21 #22 If during uploading any file wouldn't uploaded to GD ( due to Internet connection issue or skipped by user) , the text file shouldn't created.

Example: If "page_0020.pdf" not uploaded to GD, the "text_for_page_0020.txt " should not be created. I have observed that actually "text_for_page_0020.txt " created from OCRed text of page No 21.

Stage f) we want override option and upload page number as users desire. like page 1-30 or 50-100 etc.

tshrinivasan commented 8 years ago

Thanks for the clarification.

I was bit confused of this full pdf upload and thought of rewriting the entire stuff.

As you said, will improve the existing code with the asked features.