tshrinivasan / OCR4wikisource

OCR for WikiSource using Google Drive OCR
GNU General Public License v2.0
33 stars 24 forks source link

Download of pdf file fails #107

Closed Shreeshrii closed 6 years ago

Shreeshrii commented 6 years ago
INFO:root:Operating System = "Ubuntu 16.04.4 LTS"

...

INFO:__main__:Original URL = http://sanskritdocuments.org/doc_trial/fortransfer/girdhari/shArikApUjA_devIrahasyaSCAN.pdf
INFO:__main__:File Name = shArikApUjA_devIrahasyaSCAN.pdf
INFO:__main__:File Type = pdf
INFO:__main__:Created Temp folder OCR-shArikApUjA_devIrahasyaSCAN.pdf-temp-2018-05-22-08-30-09

Downloading the file shArikApUjA_devIrahasyaSCAN.pdf

INFO:__main__:Downloading the file shArikApUjA_devIrahasyaSCAN.pdf
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): sanskritdocuments.org
Traceback (most recent call last):
  File "do_ocr_jpg.py", line 137, in <module>
    total_length = int(r.headers.get('content-length'))
TypeError: int() argument must be a string or a number, not 'NoneType'
ubuntu@tesseract-ocr:~/OCR4wikisource$
Shreeshrii commented 6 years ago
INFO:__main__:Running do_ocr.py 1.54
INFO:root:Operating System = "Ubuntu 14.04.5 LTS"

...

INFO:__main__:Original URL = http://sanskritdocuments.org/doc_trial/fortransfer/girdhari/shArikApUjA_devIrahasyaSCAN.pdf
INFO:__main__:File Name = shArikApUjA_devIrahasyaSCAN.pdf
INFO:__main__:File Type = pdf
INFO:__main__:Created Temp folder OCR-shArikApUjA_devIrahasyaSCAN.pdf-temp-2018-05-22-02-23-30

Downloading the file shArikApUjA_devIrahasyaSCAN.pdf

INFO:__main__:Downloading the file shArikApUjA_devIrahasyaSCAN.pdf
INFO:urllib3.connectionpool:Starting new HTTP connection (1): sanskritdocuments.org
Traceback (most recent call last):
  File "do_ocr_jpg.py", line 137, in <module>
    total_length = int(r.headers.get('content-length'))
TypeError: int() argument must be a string or a number, not 'NoneType'
Shreeshrii commented 6 years ago

Tried on two different machines. Same error.

tshrinivasan commented 6 years ago

try uploading the same file to commons and use the url from commons and share the result.

Shreeshrii commented 6 years ago

INFO:main:Original URL = https://commons.wikimedia.org/wiki/File:ShArikApUjA_devIrahasyaSCAN.pdf INFO:main:File Name = File:ShArikApUjA_devIrahasyaSCAN.pdf INFO:main:File Type = pdf INFO:main:Created Temp folder OCR-File:ShArikApUjA_devIrahasyaSCAN.pdf-temp-2018-05-22-08-52-57

Downloading the file File:ShArikApUjA_devIrahasyaSCAN.pdf

INFO:main:Downloading the file File:ShArikApUjA_devIrahasyaSCAN.pdf INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): commons.wikimedia.org Traceback (most recent call last): File "do_ocr_jpg.py", line 137, in total_length = int(r.headers.get('content-length')) TypeError: int() argument must be a string or a number, not 'NoneType'

tshrinivasan commented 6 years ago

Try with this url

https://upload.wikimedia.org/wikipedia/commons/1/17/ShArikApUjA_devIrahasyaSCAN.pdf

Shreeshrii commented 6 years ago

Thanks, this URL works fine.

It seems that some servers do NOT provide content-length. Is it possible to modify the script to handle such cases (not display the progress bar for them but download the file).

Shreeshrii commented 6 years ago

Removed the progress bar, it works now for my usercase.

            #Download the file

            r = requests.get(url, stream=True)
            with open(filename, 'wb') as f:
                        for chunk in r.iter_content(chunk_size=1024):
                                    if chunk:
                                                f.write(chunk)
                                                f.flush()