sanskrit-coders / doc_curation

MIT License
7 stars 4 forks source link

Provide resume functionality #9

Closed lokeshh closed 4 years ago

lokeshh commented 4 years ago

When OCRing a huge book (around 2000 pages), it fails with some exception every time for example this one

...
INFO:2020-05-05 01:54:58,098:drive:37 Uploading yv_hindi_complete_tiny_splits/yv_hindi_complete_tiny_1951-1975.pdf
INFO:2020-05-05 01:54:58,102:discovery:894 URL being requested: POST https://www.googleapis.com/upload/drive/v3/files?alt=json&uploadType=resumable
---------------------------------------------------------------------------
HttpError                                 Traceback (most recent call last)
<ipython-input-14-a8ec3b44106e> in <module>()
----> 1 pdf.split_and_ocr_on_drive('yv_hindi_complete.pdf', 'key.json')

7 frames
/usr/local/lib/python3.6/dist-packages/doc_curation/pdf.py in split_and_ocr_on_drive(pdf_path, google_key, small_pdf_pages, start_page, end_page, pdf_compression_power)
     56     ocr_segments = sorted([pdf_segment + ".txt" for pdf_segment in pdf_segments])
     57     for pdf_segment in sorted(pdf_segments):
---> 58         drive_client.ocr_file(local_file_path=str(pdf_segment))
     59         os.remove(pdf_segment)
     60 

/usr/local/lib/python3.6/dist-packages/curation_utils/google/drive.py in ocr_file(self, local_file_path, ocr_file_path)
     67             logging.warning("Not OCRing: %s already exists", ocr_file_path)
     68         else:
---> 69             upload_result = self.upload(local_file_path=local_file_path)
     70             uploaded_file_id = upload_result["id"]
     71             self.download_text(local_file_path=ocr_file_path, file_id=uploaded_file_id)

/usr/local/lib/python3.6/dist-packages/curation_utils/google/drive.py in upload(self, local_file_path, mime)
     41                 'mimeType': mime
     42             },
---> 43             media_body=MediaFileUpload(local_file_path, mimetype=mime, resumable=True)
     44         ).execute()
     45         return result

/usr/local/lib/python3.6/dist-packages/googleapiclient/_helpers.py in positional_wrapper(*args, **kwargs)
    132                 elif positional_parameters_enforcement == POSITIONAL_WARNING:
    133                     logger.warning(message)
--> 134             return wrapped(*args, **kwargs)
    135 
    136         return positional_wrapper

/usr/local/lib/python3.6/dist-packages/googleapiclient/http.py in execute(self, http, num_retries)
    860             body = None
    861             while body is None:
--> 862                 _, body = self.next_chunk(http=http, num_retries=num_retries)
    863             return body
    864 

/usr/local/lib/python3.6/dist-packages/googleapiclient/_helpers.py in positional_wrapper(*args, **kwargs)
    132                 elif positional_parameters_enforcement == POSITIONAL_WARNING:
    133                     logger.warning(message)
--> 134             return wrapped(*args, **kwargs)
    135 
    136         return positional_wrapper

/usr/local/lib/python3.6/dist-packages/googleapiclient/http.py in next_chunk(self, http, num_retries)
   1043                 break
   1044 
-> 1045         return self._process_response(resp, content)
   1046 
   1047     def _process_response(self, resp, content):

/usr/local/lib/python3.6/dist-packages/googleapiclient/http.py in _process_response(self, resp, content)
   1074         else:
   1075             self._in_error_state = True
-> 1076             raise HttpError(resp, content, uri=self.uri)
   1077 
   1078         return (

HttpError: <HttpError 500 when requesting https://www.googleapis.com/upload/drive/v3/files?alt=json&uploadType=resumable returned "Internal Error">

Sometimes the problem is with the network and sometimes the Google server itself returns 500 error.

Can we have a functionality or a hack to resume where it last stopped?

vvasuki commented 4 years ago

Rerunning the code should skip already-OCR-ed segments, doesn't it? Look at the latest code in the repo for reference. If you get an idea about how it might be made better, send a pull request.

lokeshh commented 4 years ago

@vvasuki Ok I will try to fix it so that it ignores already done pdf parts.

lokeshh commented 4 years ago

@vvasuki My mistake. It indeed skips the parts which are already OCRed.