A repository for organizing contributions to the creation of an Odia Dictionary dataset for the Dictionary Augmented Translations project in C4GT'23.
To get a better idea of the code-flow, refer to the demo colab notebook
This repository is for the purposes of building a parser that is able to read the Odia.Dictionary.pdf file and parse the definitions into a Dataset. The issues tab contains aspects of the current solution that need to be worked upon and refined, or added in the future.
pip install ftfy
pip install opencv-python
conda install python-duckdb -c conda-forge
To parse the Odia Dictionary pdf:
python src/a-getting_page_images/pdf_to_imgs.py
- Converts each page of the Odia Dictionary PDF to a 300 DPI image stored in pages../pages
with the Paint application and blank-out the unwanted letter section separator by selecting the relevant portion and pressing the delete key.python src/b-cropping_page_images/cropper.py
- Crops out the columns of interest from pages 6-87. Outputs stored in pages_processedpython src/c-images_to_pdfs_with_text/pdfmaker.py
- Runs Tesseract OCR on the images in pages_processed. Outputs PDFs to parsed_pdfspython src/d-read_pdfs_with_text/reader.py
- Gets unstructured OCR text output from PDFs in parsed_pdfs. Outputs .txt files to parsed_textsrm GPT_outputs/*
- the GPT outputs folder must be emptied as the API will not be called to replace text files already present in the folder. (refer to sender.py)python src/e-gpt_api_sender/sender.py
- Calls the GPT API to structure the raw OCR text output files in parsed_texts. Outputs .txt files to GPT_outputspython src/f-dataframe_maker/preprocess.py
- moves file pointer of every .txt file in GPT_outputs to first occurence of "|"
, until every .txt file in the gpt outputs folder starts with a CSV-style column header.python src/f-dataframe_maker/maker.py
- Compiles GPT_outputs to the desired .csv - parsed_dicts/parsed_dict_very_unclean.csvFolders
Files