pishoyg / coptic

This is a project that aims to make the Coptic language more learnable.
https://remnqymi.com/
GNU General Public License v3.0
10 stars 0 forks source link
coptic coptic-language coptic-linguistics-dataset
ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ

ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ

This is the backing repo for ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ, a project that aims to make the Coptic language more learnable.

Technical Docs

Hosting

We use:

Diagram

diagram

NOTE: You can update the diagram by uploading it to draw.io.

Getting started

  1. Running make install should take care of most of the python installations.

    If there are missing binaries that you need to download them, make install will let you know.

  2. You might also want to alias python to the latest version.

  3. Our pipelines are defined in Makefile, and they correspond to blue circles in the diagram. Other pipelines in Makefile are only used during development and testing, and are not relevant for output (re)generation.

  4. Keep in mind that parameters are written with the assumption that they are being invoked from the repo's root directory, rather than from the directory where the script lives. You should do most of your development from within the root directory.

  5. This file is the only README.md in the repo (and this is enforced by a pre-commit hook). Technical documentation is intentionally centralized. Besides this file, docs can be found in:

    User-facing documentation shouldn't live on the repo, but should go on the website instead.

  6. With the exception of archive/, test/, and data/, and pre-commit/, each subdirectory of the root directory represents a major pipeline, or category of pipelines, along with their associated data. You will also notice that shared code is (intentionally) minimized, and restricted to the pre-commits and some helpers and utility functions.

  7. We use pre-commit hooks extensively, and they have helped us discover a lot of bugs and issues with our code, and also keep our repo organized. They are not optional, and many of our pipelines assume that the pre-commits have done their job. Their installation should be covered by make install. They are defined in .pre-commit-config.yaml. They run automatically before a commit, but you can trigger them with Make recipes as well by typing make add, make index, or make test (the three are synonymous). Until #120 is resolved, you will need to pay some attention to when to trigger them manually. As a rule of thumb, run them once after each pipeline, and before starting another downstream pipeline.

pre-commit

data/ Subdirectories

Most of our projects have a data subdirectory. We have somewhat strict rules regarding its content. It usually (which, in our repo, means almost always) contains three subdirectories:

.env

For now, run this once at the beginning of your coding session to export environment variables, which are necessary for some pipelines:

source .env_INFO

Equivalently:

. ./.env_INFO

Later on, you might need to create your own .env file. It is ignored by a rule in .gitignore, so there is no shared version.

It is documented in .env_INFO, so this section is intentionally brief.

Planning

We use GitHub to track our plans and TODO's.

Issues

Issues need to be as specific and isolated as possible. Most of the time, they span a single component, although they can often work mainly in one component and spill to others, and sometimes they're generic and span one aspect of multiple components (such as the conventions set for the whole repo). Issues mostly have exactly one How, and usually one Why (see labels below). Issues should involve a local change or set of local changes.

High-priority issues are defined in two ways:

Project

The project page offers alternative views of the issues, which can come in handy for planning purposes.

Milestones

Labels

Guidelines

  1. Minimize dependence on HTML, and implement behaviours in TypeScript when possible.

  2. Add in-code assertions and checks. This is our first line of defense, and has been the champion when it comes to ensuring correctness and catching bugs.

  3. We rely heavily on manual inspection of the output to verify correctness. The git --word-diff command is helpful when our line-oriented diff is not readable. Keep this in mind when structuring your output data.

  4. We force the existence of unit tests, at least one for each Python file. While these have so far been mere placeholders, the mere import of a package sometimes catches syntax errors, and the placeholders will make it convenient to write tests whenever desired. A big benefit of unit tests is that they make us confident that a change is correct, so we can speed up the development process.

  5. Do not let Python tempt you to use its built-in types instead of classes and objects. Don't forget about OOP!

  6. Document the code.

  7. We use mypy for static typing checks. While not required by mypy (which can often infer the types without hints, and would throw an error whenever an explicit type annotation is needed), it's still encouraged to use type hints extensively.

  8. Collect and print stats.

  9. Color the outputs whenever you can. It keeps your programmers entertained!

  10. Keep your code grep-able, especially when it comes to the constants used across directories.

  11. Privatize methods whenever possible. Use the name mangling feature in Python.

Languages

stats

Project-specific

dictionary/

This directory contains the data and logic for processing our dictionaries.

marcion.sourceforge.net/

Image Collection

Why?

There are many reasons we have decided to add pictures to our dictionary, and heavily invested in the image pipeline. They have become one of the integral pieces of our dictionary framework.

  1. The meaning of a word is much more strongly and concretely conveyed by an image than by a word. Learning is not about knowing vocabulary or grammar. Learning is ultimately about creating the neural pathways that enable language to flow out of you naturally. A given word needs to settle and connect with nodes in your associative memory in order for you to be able to use it. If our goal is to create or strengthen the neural pathways between a Coptic word and related nodes in your brain, then it aids the learning process to achieve as much neural activation as possible during learning. This is much better achieved by an image than by a mere translation, given the way human brains work. After all, the visual processing areas of our brains are bigger, faster, and far more ancient and primordial (even reptiles can see) compared to the language processing areas. You will often find that, when you learn a new word, the associated images pop up in your brain more readily than the translation. Thus the use of images essentially revolutionizes the language learning process.

  2. Oftentimes, the words describe an entity or concept that is unfamiliar to many users. Things like ancient crafts, plant or fish species, farmer's tools, and the like, are unfamiliar. Showing a user the English translation of a word doesn't suffice for the user to understand what it is, and they would often look up images themselves in order to find out what the word actually means. By embedding the pictures in the dictionary, we save users some time so they don't have to look it up themselves.

  3. Translations are often taken lightly by users. Pictures are not. When a dictionary author translates a given Coptic word into different English words, for example, the extra translations are often seen by users as auxiliary - tokens added there to convey a meaning that the dictionary author couldn't convey using fewer words.

    That's not the case for pictures. Pictures are taken seriously by users, and are more readily accepted as bearing a true, authentic, independent meaning of the word. Listing images (especially after we have started ascribing each image to a sense that the word conveys) is a way to recognize and legitimize those different senses and meanings that a word possesses.

    It's for this reason that images must be deeply contemplated, and a word must be digested well, before we add explanatory images for it. Collecting images is tantamount to authoring a dictionary.

Technical Guidelines

Our experience collecting images has taught us a few lessons. We tend to follow the following guidelines when we search for pictures:

  1. Each image ends up being resized to a width of 300 pixel and a height proportional to the original. We prefer images with a minimum width of 300 pixels, though down to 200 is acceptable.

  2. As for image height, short images are rarely ugly, but long images usually are. So we set a generously low lower bound of 100 pixels on the resized height, but set a stricter upper bound of 500 pixels. Although we tend to prefer the height to fall within a range of 200 to 400 pixels.

  3. Collecting sources is mandatory. We always record the URL that an image is retrieved from. Our img_helper script, which we use to process images, can be supplied by a URL, and it will download the image and store the source (and also resize the image to the final version). This simplifies the process.

  4. We make extensive use of icons. They can capture the meaning of a word in situations when it's otherwise hard to describe a word using an image (example).

  5. This hasn't been contemplated, but when given a choice, prefer an ancient Egyptian explanatory image, followed by an old (not necessarily Egyptian) image, followed by a modern image (example). We prefer to keep the images as close as possible to their reflections in the mind of a native speaker. We also want to stress the fact that those Coptic words can be equally used to refer to entities from other cultures, or modern entities.

    This could be revisited later.

Undialected Entries

The following entries have no dialect specified in Crum, so they are treated as part of all dialects.

  1. https://remnqymi.com/crum/1274.html
  2. https://remnqymi.com/crum/1292.html
  3. https://remnqymi.com/crum/1367.html
  4. https://remnqymi.com/crum/1462.html
  5. https://remnqymi.com/crum/1553.html
  6. https://remnqymi.com/crum/1555.html
  7. https://remnqymi.com/crum/1557.html
  8. https://remnqymi.com/crum/1558.html
  9. https://remnqymi.com/crum/1657.html
  10. https://remnqymi.com/crum/1659.html
  11. https://remnqymi.com/crum/1712.html
  12. https://remnqymi.com/crum/1957.html
  13. https://remnqymi.com/crum/2074.html
  14. https://remnqymi.com/crum/2075.html
  15. https://remnqymi.com/crum/2076.html
  16. https://remnqymi.com/crum/2077.html
  17. https://remnqymi.com/crum/2078.html
  18. https://remnqymi.com/crum/2079.html
  19. https://remnqymi.com/crum/2081.html
  20. https://remnqymi.com/crum/2082.html
  21. https://remnqymi.com/crum/2084.html
  22. https://remnqymi.com/crum/2085.html
  23. https://remnqymi.com/crum/2086.html
  24. https://remnqymi.com/crum/2087.html
  25. https://remnqymi.com/crum/2088.html
  26. https://remnqymi.com/crum/2090.html
  27. https://remnqymi.com/crum/2091.html
  28. https://remnqymi.com/crum/2092.html
  29. https://remnqymi.com/crum/2093.html
  30. https://remnqymi.com/crum/2195.html
  31. https://remnqymi.com/crum/2205.html
  32. https://remnqymi.com/crum/2832.html
  33. https://remnqymi.com/crum/3117.html
  34. https://remnqymi.com/crum/3230.html
  35. https://remnqymi.com/crum/3231.html
  36. https://remnqymi.com/crum/3257.html
  37. https://remnqymi.com/crum/3302.html

NOTE: Some undialected entries in this list have been removed because their dialect was inferred, e.g. all the entries under Ⳉ have been labeled as Akhmimic.

We are rethinking the current handling of undialected entries. See #237.

Entries that are Absent in Crum

The following entries are absent from Crum's dictionary. They were added to our database from other sources:

  1. https://remnqymi.com/crum/3379.html
  2. https://remnqymi.com/crum/3380.html
  3. https://remnqymi.com/crum/3381.html
  4. https://remnqymi.com/crum/3382.html
  5. https://remnqymi.com/crum/3385.html

copticocc.org/

dawoud-D100/ contains scans of Moawad Dawoud's dictionary. They are obtained from the PDF using the imagemagick command. (The density used is 100, hence the prefix -D100.)

The PDF / image processing scripts can be found under archive/dictionary/copticocc.org

kellia.uni-goettingen.de/

We had some plans to combine the strength of KELLIA and Crum (#53, #6), but they have been abandoned.

copticsite.com/

bible/

This directory contains the data and logic for processing the Bible corpus.

stshenouda.org/

There are several published versions of the Coptic Bible. The most recent, and most complete, is that of St. Shenouda the Archmandrite Coptic Society. It is the Coptic Bible project that is most worthy of investment at the moment.

flashcards/

This directory contains the data and logic for processing dictionaries into flashcards. It is named as such because our first use case was a flashcard app, although our use of the dictionaries has since become more versatile.

Anki Keys and Synchronization

When you import a package into your (personal) Anki database, Anki uses the IDs to eliminate duplicates.

Uniqueness is therefore important. But what is trickier, and perhaps more important, is persistence. If we export new versions of a certain deck regularly, we should maintain persistent IDs to ensure correct synchronization. Otherwise, identical pieces of data that have distinct IDs will result in duplicates.

There are three types of IDs in the generated package:

  1. Note ID

genanki suggests defining the GUID as a hash of a subset of fields that uniquely identify a note.

The GUID must be unique across decks. Therefore, this subset of field values must be unique, including across decks. You can solve this by prefixing the keys with the name of the deck.

In our script, we ask the user to provide a list of keys as part of their input, along the list of fronts, backs, deck names, ... etc. The users of the package must assign the keys properly, ensuring uniqueness, and refraining from changing / reassigning them afterwards.

This is somewhat straightforward for Marcion's words. Use of Marcion's IDs for synchronization should suffice.

For the Bible, we could use the verse reference as a note ID, and ensure that the book names, chapter numbers, and verse numbers don't change in a following version.

For other data creators without programming expertise, a sequence number works as long as nobody inserts a new row in the middle of the CSV, which would mess up the keys. Discuss keying with those creators. As of today, only copticsite.com's data has this problem.

  1. Deck ID

Whenever possible, we use a hardcoded deck ID. This is not possible for decks that are autogenerated, such as the Bible decks which are separated for nesting (as opposed to being grouped in a single deck). In such cases, we use a hash of the deck name, and the deck name becomes a protected field.

  1. Model ID

Model IDs are hardcoded.

See also

  1. #36
  2. #37

morphology/

This directory contains the data and logic for generating the morphological dictionaries (to support inflections).

site/

This directory contains the data and logic for creating and publishing our website.

License and Cited Works

Code is released under GPL-3.0. Lexicon data is released under CC BY-SA 4.0. License: GPL v3 License: CC BY-SA 4.0

Marcion Saint Shenouda The Archimandrite – Coptic Society copticsite.com Coptic Scriptorium Freie Universität Berlin BBAW TLA KELLIA


Ⲉ̀ϣⲱⲡ ⲁⲓϣⲁⲛⲉⲣⲡⲉⲱⲃϣ Ⲓⲗ̅ⲏ̅ⲙ̅, ⲉⲓⲉ̀ⲉⲣⲡⲱⲃϣ ⲛ̀ⲧⲁⲟⲩⲓⲛⲁⲙ: Ⲡⲁⲗⲁⲥ ⲉϥⲉ̀ϫⲱⲗϫ ⲉ̀ⲧⲁϣ̀ⲃⲱⲃⲓ ⲉ̀ϣⲱⲡ ⲁⲓϣ̀ⲧⲉⲙⲉⲣⲡⲉⲙⲉⲩⲓ.