lukehollis / iip-word-lists

Python utility for creating word lists from epidoc files
1 stars 1 forks source link

Introduction

The code in this repository is intended for use in the Inscriptions of Israel / Palestine project. It uses Python and LXML to generate word lists from epidoc files and includes a simple web interface.

Setup

  1. Clone or download the repository.
  2. Enter the project directory with cd iip-word-lists
  3. Create a virtual environment with the appropriate dependencies by running virtualenv -p python3 environment. If you do not have virtualenv installed, install it using your system's package manager, or with pip by running pip install virtualenv.
  4. Activate the virtual environment by running source environment/bin/activate (The virtual environment must be active whenever you run the python code or rebuild the site. If it is active, you should see (environment) before the prompt in your terminal.)
  5. Install the necessary dependencies by running pip install -r requirements.txt

To run the site locally

  1. Enter the docs directory with cd docs
  2. Start an http server by running: python -m SimpleHTTPServer 8000 or in Python3: python3 -m http.server 8000
  3. Open localhost:8000 in your web browser

(You can view the files without running the server, but some links will not work.)

To build the site

  1. from the root project directory, run ./build_site.sh. Add -nu if you are updating the site and do not wish to download the xml files.

Project structure

Functionality

Lemmatization

A word's lemma is its "basic" form as it might appear in a dictionary. For instance, the lemma of "rethinking" is "think." The process of getting a lemma from a word is called "lemmatization." Lemmatization allows this project to recognize different strings as instances of the same word, which is very useful for learning about the usage and distributions of specific words.

Lemmatization is currently done only for Latin and Greek, as provided by CLTK.

Libraries

This project uses several libraries and toolkits.

Problems Encountered

Misc

Thank you to the Unicode Consortium for keeping us on our toes by including all these as separate characters: · ‧ ⋅ • ∙.