This is the backing repo for ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ, a project that aims to make the Coptic language more learnable.
We use:
NOTE: You can update the diagram by uploading it to draw.io.
Running make install
should take care of most of the python installations.
If there are missing binaries that you need to download them, make install
will let you know.
You might also want to alias python
to the latest version.
Our pipelines are defined in Makefile
, and they correspond to
blue circles in the diagram. Other pipelines in Makefile
are only
used during development and testing, and are not relevant for output
(re)generation.
Keep in mind that parameters are written with the assumption that they are being invoked from the repo's root directory, rather than from the directory where the script lives. You should do most of your development from within the root directory.
This file is the only README.md
in the repo (and this is enforced by a
pre-commit hook). Technical documentation is intentionally centralized.
Besides this file, docs can be found in:
User-facing documentation shouldn't live on the repo, but should go on the website instead.
With the exception of archive/
, test/
, and
data/
, and pre-commit/
, each subdirectory of the
root directory represents a major pipeline, or category of pipelines, along
with their associated data. You will also notice that shared code is
(intentionally) minimized, and restricted to the pre-commits and some helpers
and utility functions.
We use pre-commit hooks extensively, and they have helped us discover a lot
of bugs and issues with our code, and also keep our repo organized. They are
not optional, and many of our pipelines assume that the pre-commits have done
their job. Their installation should be covered by make install
. They are
defined in .pre-commit-config.yaml
. They run
automatically before a commit, but you can trigger them with Make recipes as
well by typing make add
, make index
, or make test
(the three are
synonymous).
Until #120 is resolved, you
will need to pay some attention to when to trigger them manually. As a rule of
thumb, run them once after each pipeline, and before starting another
downstream pipeline.
data/
SubdirectoriesMost of our projects have a data
subdirectory. We have somewhat strict rules
regarding its content. It usually (which, in our repo, means almost always)
contains three subdirectories:
raw/
: Data that is copied from elsewhere. This would, for example,
include the Marcion SQL tables copied as is, unmodified. The contents of this
directory remain true to the original source.
input/
: Data that we either modified or created. If we want to fix
typos to data that we copied, we don't touch the data under raw/
, but we take
the liberty to modify the copies that live under input/
.
This directory also includes the data that we created ourselves.
You can show the delta between raw and input data using git diff --no-index
. It's also good to be aware of the --word-diff
flag.
output/
: This contains the data written by our pipelines,
one subdirectory per format. If your pipeline writes both TSV and HTML,
they should go respectively to output/tsv/
and output/html/
.
.env
For now, run this once at the beginning of your coding session to export environment variables, which are necessary for some pipelines:
source .env_INFO
Equivalently:
. ./.env_INFO
Later on, you might need to create your own .env
file. It is ignored by a
rule in .gitignore
, so there is no shared version.
It is documented in .env_INFO
, so this section is intentionally
brief.
We use GitHub to track our plans and TODO's.
Issues need to be as specific and isolated as possible. Most of the time, they span a single component, although they can often work mainly in one component and spill to others, and sometimes they're generic and span one aspect of multiple components (such as the conventions set for the whole repo). Issues mostly have exactly one How, and usually one Why (see labels below). Issues should involve a local change or set of local changes.
High-priority issues are defined in two ways:
The project page offers alternative views of the issues, which can come in handy for planning purposes.
Milestones represent more complex pieces of work. Their size is undetermined. They could weeks or years, but they are not simple enough to span just a few days. This is their main use case.
There is a second, somewhat unorthodox, use case for milestones as component backlogs backlogs, for miscellaneous issues related to some component that don't belong to a goal that we've already defined and crystalized into a milestone.
Every issue must belong to a milestone.
Milestone priorities are assigned using due dates. Milestones help make long-term plans.
The number of milestones should remain "under control".
The platform component milestone refers to the development platform and tooling. Issues under this milestone are mainly developer-facing rather than user-facing, and their purpose is to improve the framework that developers use to drive the project forward. This component is about sharpening our saw so we can cut wood faster.
When work on a milestone is good enough, it's closed, the achievement is celebrated, and its remaining issues move to the corresponding component backlog milestone.
Component-specific milestones are often named as component versions. (For example, Site v1.0 is a milestone referring to the first release of the Site).
Backlog milestone are often named after the component, but without a version, and often with the prefix Pipeline:.
All issues should be labeled.
We assign the following categories of labels to issues:
How
architect
: Architecture and design.diplomacy
: Diplomacy, connections, and reachout.documentation
: Writing documentation.labor
: Manual data collection.freelance
: Hiring a freelancer.Who
user
: A user-oriented improvement.dev
: A developer-oriented, not user-visible, improvement.Why
data collection
: Expand the data that we own.maintenance
: Maintain existing territories, rather than expand into
new ones.rigor
: Improve the rigor (particularly parsing, or inflection
generation).UI
: Improve the user interface.bug
: Fix a bug.Minimize dependence on HTML, and implement behaviours in TypeScript when possible.
Add in-code assertions and checks. This is our first line of defense, and has been the champion when it comes to ensuring correctness and catching bugs.
We rely heavily on manual inspection of the output to verify correctness.
The git --word-diff
command is helpful when our line-oriented diff
is
not readable. Keep this in mind when structuring your output data.
We force the existence of unit tests, at least one for each Python file. While these have so far been mere placeholders, the mere import of a package sometimes catches syntax errors, and the placeholders will make it convenient to write tests whenever desired. A big benefit of unit tests is that they make us confident that a change is correct, so we can speed up the development process.
Do not let Python tempt you to use its built-in types instead of classes and objects. Don't forget about OOP!
Document the code.
We use mypy
for static typing checks. While not required by mypy
(which
can often infer the types without hints, and would throw an error whenever an
explicit type annotation is needed), it's still encouraged to use type hints
extensively.
Collect and print stats.
Color the outputs whenever you can. It keeps your programmers entertained!
Keep your code grep
-able, especially when it comes to the constants used
across directories.
Privatize methods whenever possible. Use the name mangling feature in Python.
Our pipelines are primarily written in Python. There is minimal logic in Bash.
We have a strong bias for Python over Bash. Use Bash if you expect the number of lines of code of an equivalent Python piece to be significantly more.
We use TypeScript for static site logic. It then gets transpiled to
JavaScript by running make transpile
. We don't write JavaScript directly.
We expect to make a similar platform-specific expansion into another territory for the app.
In the past, we voluntarily used Java (for an archived project). Won't happen again! We also used VBA and JS for Microsoft Excel and Google Sheet macros (also archived at the moment) because they were required by the platform.
It is desirable to strike a balance between the benefits of focusing on a small number of languages, and the different powers that different language can uniquely exhibit. We won't compromise the latter for the former. Use the right language for a task. When two languages can do a job equally well, uncompromisingly choose the one that is more familiar.
stats
dictionary/
This directory contains the data and logic for processing our dictionaries.
marcion.sourceforge.net/
There are many reasons we have decided to add pictures to our dictionary, and heavily invested in the image pipeline. They have become one of the integral pieces of our dictionary framework.
The meaning of a word is much more strongly and concretely conveyed by an image than by a word. Learning is not about knowing vocabulary or grammar. Learning is ultimately about creating the neural pathways that enable language to flow out of you naturally. A given word needs to settle and connect with nodes in your associative memory in order for you to be able to use it. If our goal is to create or strengthen the neural pathways between a Coptic word and related nodes in your brain, then it aids the learning process to achieve as much neural activation as possible during learning. This is much better achieved by an image than by a mere translation, given the way human brains work. After all, the visual processing areas of our brains are bigger, faster, and far more ancient and primordial (even reptiles can see) compared to the language processing areas. You will often find that, when you learn a new word, the associated images pop up in your brain more readily than the translation. Thus the use of images essentially revolutionizes the language learning process.
Oftentimes, the words describe an entity or concept that is unfamiliar to many users. Things like ancient crafts, plant or fish species, farmer's tools, and the like, are unfamiliar. Showing a user the English translation of a word doesn't suffice for the user to understand what it is, and they would often look up images themselves in order to find out what the word actually means. By embedding the pictures in the dictionary, we save users some time so they don't have to look it up themselves.
Translations are often taken lightly by users. Pictures are not. When a dictionary author translates a given Coptic word into different English words, for example, the extra translations are often seen by users as auxiliary - tokens added there to convey a meaning that the dictionary author couldn't convey using fewer words.
That's not the case for pictures. Pictures are taken seriously by users, and are more readily accepted as bearing a true, authentic, independent meaning of the word. Listing images (especially after we have started ascribing each image to a sense that the word conveys) is a way to recognize and legitimize those different senses and meanings that a word possesses.
It's for this reason that images must be deeply contemplated, and a word must be digested well, before we add explanatory images for it. Collecting images is tantamount to authoring a dictionary.
Our experience collecting images has taught us a few lessons. We tend to follow the following guidelines when we search for pictures:
Each image ends up being resized to a width of 300 pixel and a height proportional to the original. We prefer images with a minimum width of 300 pixels, though down to 200 is acceptable.
As for image height, short images are rarely ugly, but long images usually are. So we set a generously low lower bound of 100 pixels on the resized height, but set a stricter upper bound of 500 pixels. Although we tend to prefer the height to fall within a range of 200 to 400 pixels.
Collecting sources is mandatory. We always record the URL that an image is
retrieved from. Our img_helper
script, which we use to
process images, can be supplied by a URL, and it will download the image and
store the source (and also resize the image to the final version). This
simplifies the process.
We make extensive use of icons. They can capture the meaning of a word in situations when it's otherwise hard to describe a word using an image (example).
This hasn't been contemplated, but when given a choice, prefer an ancient Egyptian explanatory image, followed by an old (not necessarily Egyptian) image, followed by a modern image (example). We prefer to keep the images as close as possible to their reflections in the mind of a native speaker. We also want to stress the fact that those Coptic words can be equally used to refer to entities from other cultures, or modern entities.
This could be revisited later.
The following entries have no dialect specified in Crum, so they are treated as part of all dialects.
NOTE: Some undialected entries in this list have been removed because their dialect was inferred, e.g. all the entries under Ⳉ have been labeled as Akhmimic.
We are rethinking the current handling of undialected entries. See #237.
The following entries are absent from Crum's dictionary. They were added to our database from other sources:
copticocc.org/
dawoud-D100/
contains scans of
Moawad Dawoud's dictionary. They are obtained from the
PDF
using the imagemagick
command. (The density used
is 100, hence the prefix -D100
.)
The PDF / image processing scripts can be found under
archive/dictionary/copticocc.org
kellia.uni-goettingen.de/
We had some plans to combine the strength of KELLIA and Crum (#53, #6), but they have been abandoned.
copticsite.com/
bible/
This directory contains the data and logic for processing the Bible corpus.
stshenouda.org/
There are several published versions of the Coptic Bible. The most recent, and most complete, is that of St. Shenouda the Archmandrite Coptic Society. It is the Coptic Bible project that is most worthy of investment at the moment.
flashcards/
This directory contains the data and logic for processing dictionaries into flashcards. It is named as such because our first use case was a flashcard app, although our use of the dictionaries has since become more versatile.
When you import a package into your (personal) Anki database, Anki uses the IDs to eliminate duplicates.
Uniqueness is therefore important. But what is trickier, and perhaps more important, is persistence. If we export new versions of a certain deck regularly, we should maintain persistent IDs to ensure correct synchronization. Otherwise, identical pieces of data that have distinct IDs will result in duplicates.
There are three types of IDs in the generated package:
genanki
suggests
defining the GUID as a hash of a subset of fields that uniquely identify a
note.
The GUID must be unique across decks. Therefore, this subset of field values must be unique, including across decks. You can solve this by prefixing the keys with the name of the deck.
In our script, we ask the user to provide a list of keys as part of their input, along the list of fronts, backs, deck names, ... etc. The users of the package must assign the keys properly, ensuring uniqueness, and refraining from changing / reassigning them afterwards.
This is somewhat straightforward for Marcion's words. Use of Marcion's IDs for synchronization should suffice.
For the Bible, we could use the verse reference as a note ID, and ensure that the book names, chapter numbers, and verse numbers don't change in a following version.
For other data creators without programming expertise, a sequence number works as long as nobody inserts a new row in the middle of the CSV, which would mess up the keys. Discuss keying with those creators. As of today, only copticsite.com's data has this problem.
Whenever possible, we use a hardcoded deck ID. This is not possible for decks that are autogenerated, such as the Bible decks which are separated for nesting (as opposed to being grouped in a single deck). In such cases, we use a hash of the deck name, and the deck name becomes a protected field.
Model IDs are hardcoded.
morphology/
This directory contains the data and logic for generating the morphological dictionaries (to support inflections).
site/
This directory contains the data and logic for creating and publishing our website.
Code is released under GPL-3.0. Lexicon data is released under CC BY-SA 4.0.
Ⲉ̀ϣⲱⲡ ⲁⲓϣⲁⲛⲉⲣⲡⲉⲱⲃϣ Ⲓⲗ̅ⲏ̅ⲙ̅, ⲉⲓⲉ̀ⲉⲣⲡⲱⲃϣ ⲛ̀ⲧⲁⲟⲩⲓⲛⲁⲙ: Ⲡⲁⲗⲁⲥ ⲉϥⲉ̀ϫⲱⲗϫ ⲉ̀ⲧⲁϣ̀ⲃⲱⲃⲓ ⲉ̀ϣⲱⲡ ⲁⲓϣ̀ⲧⲉⲙⲉⲣⲡⲉⲙⲉⲩⲓ.