shrivastava95 / odia-dictionary

A repository for organizing contributions to the creation of an Odia Dictionary dataset for the Dictionary Augmented Translations project in C4GT'23.
4 stars 1 forks source link

odia-dictionary

A repository for organizing contributions to the creation of an Odia Dictionary dataset for the Dictionary Augmented Translations project in C4GT'23.

demo notebook

To get a better idea of the code-flow, refer to the demo colab notebook

Description

This repository is for the purposes of building a parser that is able to read the Odia.Dictionary.pdf file and parse the definitions into a Dataset. The issues tab contains aspects of the current solution that need to be worked upon and refined, or added in the future.

Dependencies / setup

  1. pdf2image setup
  2. Tesseract API setup
  3. PyPDF2 PDF Reader setup
  4. OpenAI API setup
  5. [FTFY - Fixes Unicode]: run pip install ftfy
  6. pip install opencv-python
  7. Enchant setup
  8. DuckDB Python API setup - conda install python-duckdb -c conda-forge

How to run

To parse the Odia Dictionary pdf:

  1. python src/a-getting_page_images/pdf_to_imgs.py - Converts each page of the Odia Dictionary PDF to a 300 DPI image stored in pages.
  2. Open the png files generated in ./pages with the Paint application and blank-out the unwanted letter section separator by selecting the relevant portion and pressing the delete key.
  3. python src/b-cropping_page_images/cropper.py - Crops out the columns of interest from pages 6-87. Outputs stored in pages_processed
  4. python src/c-images_to_pdfs_with_text/pdfmaker.py - Runs Tesseract OCR on the images in pages_processed. Outputs PDFs to parsed_pdfs
  5. python src/d-read_pdfs_with_text/reader.py - Gets unstructured OCR text output from PDFs in parsed_pdfs. Outputs .txt files to parsed_texts
  6. rm GPT_outputs/* - the GPT outputs folder must be emptied as the API will not be called to replace text files already present in the folder. (refer to sender.py)
  7. python src/e-gpt_api_sender/sender.py - Calls the GPT API to structure the raw OCR text output files in parsed_texts. Outputs .txt files to GPT_outputs
  8. python src/f-dataframe_maker/preprocess.py - moves file pointer of every .txt file in GPT_outputs to first occurence of "|", until every .txt file in the gpt outputs folder starts with a CSV-style column header.
  9. python src/f-dataframe_maker/maker.py - Compiles GPT_outputs to the desired .csv - parsed_dicts/parsed_dict_very_unclean.csv

Structure

Folders

Files