The code in this repository is intended for use in the Inscriptions of Israel / Palestine project. It uses Python and LXML to generate word lists from epidoc files and includes a simple web interface.
cd iip-word-lists
virtualenv -p python3 environment
. If you do not have
virtualenv installed, install it using your system's package manager,
or with pip by running pip install virtualenv
.source environment/bin/activate
(The virtual environment must be active whenever you run the python
code or rebuild the site. If it is active, you should see (environment)
before the prompt in your terminal.)pip install -r requirements.txt
cd docs
python -m SimpleHTTPServer 8000
or in Python3: python3 -m http.server 8000
localhost:8000
in your web browser(You can view the files without running the server, but some links will not work.)
./build_site.sh
. Add -nu
if
you are updating the site and do not wish to download the xml files.docs
contains the files for the github pages site.
texts
contains the files representing individual inscription.xml
contains these files in their original XML form.plain
contains plain text representations of the inscriptionsplain_lemma
contains the same as above, but using lemmas of each word instead of the actual text as it appeared in the inscription.doubletreejs
contains code for the DoubleTreeJS visualization library.src
contains the list creation script and the html and css templates for the site.
python
contains the python scripts for processing the datawordlist.py
is the python script that generates word lists. The basic usage is ./wordlist.py <epidoc files to process>
. By default, the list will be printed to the terminal, other output formats can be specified with flags. Run ./wordlist.py --help
for information on usage.web
contains the css, and javascript and html templates used to build the site..gitignore
lists files that should not be included in the repository, such as lock files, etc.README.md
lists information about the project.build_site.sh
is a bash script that rebuilds the site, outputting to the docs
directory. It can be run by typing ./build_site
in the terminal from the root project directory. To rebuild the site without re-downloading the epidoc files, run ./build_site --use-existing
. To rebuild the site without updating the word-lists (for example, when working on the frontend), run ./build_site --no-update
. For help, run ./build_site --help
.A word's lemma is its "basic" form as it might appear in a dictionary. For instance, the lemma of "rethinking" is "think." The process of getting a lemma from a word is called "lemmatization." Lemmatization allows this project to recognize different strings as instances of the same word, which is very useful for learning about the usage and distributions of specific words.
Lemmatization is currently done only for Latin and Greek, as provided by CLTK.
This project uses several libraries and toolkits.
include_trailing_linebreak
.
However, this is not comprehensive. A complete list based on the epidoc
spec should be added.<num>
elements always indicate the start of a new word?Thank you to the Unicode Consortium for keeping us on our toes by including all these as separate characters: · ‧ ⋅ • ∙.