A package for curating doc file collections. Prominent features:
pip install doc_curation -U -e.[all]
pip install git+https://github.com/sanskrit-coders/doc_curation/@master -U -e.[all]
from doc_curation.pdf import drive_ocr
pdf_file = '/home/file.pdf'
key_file = '/home/key.json'
drive_ocr.split_and_ocr_on_drive(pdf_path=pdf_file, google_key=key_file, small_pdf_pages=5)
Command line invocation:
# For help and details -
/usr/bin/python3 -m doc_curation.pdf.drive_ocr --help
/usr/bin/python3 -m doc_curation.pdf.drive_ocr --input_path=/some/Dir/Or/File --google_key=/some/path/service_account_key.json --small_pdf_pages=5
google_vision_pdf.py
to OCR pdf to txt files.Follow the instructions here: https://cloud.google.com/vision/docs/before-you-begin.
Make sure to set the environment variable for GOOGLE_APPLICATION_CREDENTIALS
to the path of json containing your service account key.
Example:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
Invoke the script passing in the input file. Eg:
python3 google_vision_pdf.py --input-file <input.pdf>
/usr/bin/python3 -m doc_curation.pdf.google_vision_pdf --input-file <input.pdf>
Have a problem or question? Please head to github.
python setup.py bdist_wheel
twine upload dist/* --skip-existing
cd docs; make html
Run pytest
in the root directory.