Closed tshrinivasan closed 8 years ago
Last month I have finished OCR job manually 600 page book, each page was near about 1MB size PDF format. also I have tries 450 pages book 52 MB, converted all pages to PNG format with 300dpi , each page was 700Kb-800Kb.
I tried uploading the pdf as single whole file and open with google doc. Not splitting or converting to individual pages or images.
Try uploading as whole files and find if can avoid the process of splitting to individual pages.
After using few days, I have an opinion....
So my opinion this is good, don't go for multiple PDF upload. Because we have upload one by one Page: namespace of Wikisource.
My suggestion/proposal for need some more improvements at
Stage a) If downloaded file already exist at script folder, it should not trying to download again. It should use the existing file to next step.
Stage d) We need option to which page to which page want to upload. page 1-30 or 50-100 etc. And also Ravi and I mention at "Network connection issues" #21 #22 If during uploading any file wouldn't uploaded to GD ( due to Internet connection issue or skipped by user) , the text file shouldn't created.
Example: If "page_0020.pdf" not uploaded to GD, the "text_for_page_0020.txt " should not be created. I have observed that actually "text_for_page_0020.txt " created from OCRed text of page No 21.
Stage f) we want override option and upload page number as users desire. like page 1-30 or 50-100 etc.
Thanks for the clarification.
I was bit confused of this full pdf upload and thought of rewriting the entire stuff.
As you said, will improve the existing code with the asked features.
https://support.google.com/drive/answer/176692?hl=en
This page says as
File size limitations
The maximum size for images (.jpg, .gif, .png) and PDF files (.pdf) is 2 MB. For PDF files, we only look at the first 10 pages when searching for text to extract.
But, When tried to upload a 80 page(1.8 MB) pdf manually and open as google doc, it OCRed it and displayed text for full 80 pages.
When tried a 24 MB PDF, it can not open as google doc.
Try with various PDF for wikisource manually and find the page/Size upper limit for PDF files.
@ravidreams @jayantanth @BodhisattwaMandal @omshivaprakash