sanskrit-coders/doc_curation

doc curation

A package for curating doc file collections. Prominent features:

Scrape texts off various sites, such as Wikisource. See example here. (PS: Consider contributing to raw_etexts repo. )
OCR some pdf with google drive. Automatically splits into 25 page bits and ocrs them individually. See usage example here, function here.

For users

Autogenerated Docs on readthedocs (might be broken).
Manually and periodically generated docs here
For detailed examples and help, please see individual module files in this package.

Installation or upgrade

For stable version pip install doc_curation -U -e.[all]
For latest code pip install git+https://github.com/sanskrit-coders/doc_curation/@master -U -e.[all]
Web.

Usage

Google Drive API wrapper

Enable Google Drive API and download service account key file having Google Driver API access. (See details in split_and_ocr_on_drive function documentation (eg. github source).)

from doc_curation.pdf import drive_ocr
pdf_file = '/home/file.pdf'
key_file = '/home/key.json'
drive_ocr.split_and_ocr_on_drive(pdf_path=pdf_file, google_key=key_file, small_pdf_pages=5)

Command line invocation:

# For help and details - 
/usr/bin/python3 -m doc_curation.pdf.drive_ocr --help
/usr/bin/python3 -m doc_curation.pdf.drive_ocr --input_path=/some/Dir/Or/File --google_key=/some/path/service_account_key.json --small_pdf_pages=5

Usage for the `google_vision_pdf.py` to OCR pdf to txt files.

Follow the instructions here: https://cloud.google.com/vision/docs/before-you-begin.
Make sure to set the environment variable for GOOGLE_APPLICATION_CREDENTIALS to the path of json containing your service account key.

Example:

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"

Invoke the script passing in the input file. Eg:

python3 google_vision_pdf.py --input-file <input.pdf>
/usr/bin/python3 -m doc_curation.pdf.google_vision_pdf  --input-file <input.pdf>

For contributors

Contact

Have a problem or question? Please head to github.

Packaging

~/.pypirc should have your pypi login credentials.

python setup.py bdist_wheel
twine upload dist/* --skip-existing

Build documentation

sphinx html docs can be generated with cd docs; make html

Testing

Run pytest in the root directory.

Auxiliary tools

pyup

sanskrit-coders / doc_curation

readme

doc curation

For users

Installation or upgrade

Usage

Google Drive API wrapper

Usage for the `google_vision_pdf.py` to OCR pdf to txt files.

For contributors

Contact

Packaging

Build documentation

Testing

Auxiliary tools

sanskrit-coders / doc_curation

readme

doc curation

For users

Installation or upgrade

Usage

Google Drive API wrapper

Usage for the google_vision_pdf.py to OCR pdf to txt files.

For contributors

Contact

Packaging

Build documentation

Testing

Auxiliary tools

Usage for the `google_vision_pdf.py` to OCR pdf to txt files.