kingsdigitallab / crossreads

Palaeographical environment for CROSSREADS project
1 stars 0 forks source link

Review Allograph vs Character in the data model #64

Open geoffroy-noel-ddh opened 4 months ago

geoffroy-noel-ddh commented 4 months ago

There's a conceptual issue we identified before and we need to address relatively soon because it will have an impact on the data model, workflow, UI and data storage.

Key question: how can we define structural variants of a graphical representation of a character within a script?

Sub-questions:

Use case(s)

TODO: @simonastoyanova do you have visual examples of the character you showed me last time? I can't remember which character it was? could you also link to annotations? Do you have names for the variants?

Current model

I'll soon provide some background to the current model in the github wiki (TODO: GN).

In short, the model is purposefully/pragmatically reductive and naively conceives an allograph as a pair (Script, Character). It is also designed to leverage Unicode metadata. So looking up Δ in Unicode would show us a relevant Glyph, Script (Greek), character name (GREEK CAPITAL LETTER DELTA). Mapping to Unicode allows us to be more standard, interoperable & keep the model simpler. There are some caveats (e.g. missing scripts & allographs in Unicode), but I think they can be dealt with. More on that in upcoming wiki. More specific allographs may not have a match in Unicode, but the idea is that we should still be able to map to the closest character.

Note that Unicode character is different from DigiPal character (more general?). TODO: GN to check.

How to implement variants with this model?

Below are different options, from more casual (during discovery) to more formally defined. The current data model should already allow the move from one option to the next, although we'd need a UI & data workflow in place to assist with definitional transformation without breaking references or having to recreate definitions. This would allow for fluid & flexible evolution of the definitions, where variants, grouping and fomalism is introduced when/as needed. The following options rely on a dotted notation (for script & allographs) and sufficient mapping to Unicode.

Option 1: bag of variants

Create a single allograph that has all possible components of the variants. Add an 'absent' feature to components that are not shared among all variants. Annotate graphs and tag with a name for each variant (e.g. type-1, type-2).

Problem: this deviates from the notion of Allograph from DigiPal. It's more like a bag of allographs. It's ok temporarily but becomes awkward as it absorbs more variants. At that point we should better separate the variants (see Option 2) in the definitions.

Option 2: separate allographs

Create a new allograph for each variant. E.g. Greek/E, Greek/E.type-2, Greek/E.type-3. That is one item per variant under the script, with an agreed notation (e.g. character.variant) in the name to preserve the link to the overarching character (Greek E) and among its variants. This notation restores the conceptual hierarchy between character and allograph.

If we notice that some variants of different characters tend to co-occur among different inscriptions, it might make sense to group them by using the same name. E.g. Greek/E.type-2 tend to co-occur with Greek/B.type-2, hence the use of type-2. If many variants tend to occur together and form a pattern, we might want to move to Option 3.

Option 3: sub-scripting

Create a new script for a recognised pattern of co-occurring set of variants of different characters. (E.g. In DigiPal that would be like "Caroline", or "Square"). Move the variants there and rename them accordingly: e.g. Latin/A.caroline -> Latin.Caroline/A. In this case we can use a dotted notation to imply a hierarchy of scripts.

One issue with this method, is that only a few characters differ from the parent script. Caroline is a good example. In practice, it would mean that we would annotate a text with both the Caroline subscript (e.g. a, d, f, g, h, r, s) and the Latin script (e.g. p). Which is a potential discrepancy we also had in DigiPal. With that approach a graph has a single script but technically a word or a text can use multiple script. I.e. Should a p be Caroline because it occurs next to a Caroline a? Should a text be Caroline because it uses Caroline allographs?

An alternative approach would be to define all the Latin alphabet in Caroline but this would introduce unproductive duplication of definitions and possibly more conceptual confusion? It might be the approach taken by Unicode where identical Glyphs have multiple unicode points under different scripts (e.g. Majuscule letters in Latin & Greek).

geoffroy-noel-ddh commented 4 months ago

I'm also copying you here @JonPrag Jonathan. No need to react but just to keep you in the loop if you are interested in the discussion.

JonPrag commented 4 months ago

Unless I've missed something fundamental here, we have an essential problem (which I wish we'd confronted long ago), deriving from the DigiPal focus upon scripts. We have no meaningful interest in scripts, and these have introduced what seems to me to be a mis-use of allograph in our model (not in digipal, because scripts are made up of allographs, which is what makes them scripts). Unicode maps (more or less) onto graphemes [if you want to go mad, try https://hal.science/hal-02383627/document]. I am entirely happy, for our purposes, that we should employ the Unicode code points as the 'ideal type' of individual letters in either Greek or Latin and make these the top level. But these are categorically NOT equivalent to allographs, which are variations of a particular unicode character codepoint aka grapheme. What we therefore need is a hierarchical model where:

  1. the letter 'A' is defined as Latin capital A (U+0041) [let's ignore the fact that for now that is itself an allograph of the grapheme 'A', since lower case 'a' is already another allograph of the grapheme 'A' in Latin; we can ignore it because epigraphic language is all upper case]
  2. variants of A, as identified through the component and feature coding in archetype, are allographs of A, and should all be grouped under the grapheme / unicode character 'A'.
  3. clustering of particular allographs of different graphemes/characters could result in the identification of scripts BUT in the world of epigraphy, that is entirely secondary and derivative and rarely done.

The essential interest lies in the ability to reference allographs at the level below grapheme. A 'bag' of allographs, without the top level grapheme isn't very helpful.

geoffroy-noel-ddh commented 4 months ago

Thank you for correcting my sloppy conceptualisation! That's very useful.

Here's an updated entity diagram based on your explanation. Hopefully that's closer to what it should be?

image

No subscripts

Now, from what you explain, option 3, and associated implementations, are not needed for Crossreads. So I will no longer suggest this way of organising data in the context of your project. But I'll leave it as an option to consider if the annotator is ever reused or if we wanted to have a data model that provide some level of compatibility with Archetype instances.

Mechanisms for hierarchies

It seems to me that Option 1 & 2 (although I didn't describe them with the right terminology) do offer the desired hierarchy between grapheme and allograph via the tag (option 1, as a temporary/work-in-progress measure while annotating the first instances of new variants) or the dotted notation (option 2, when the allograph identification is more settled, i.e. it is a "recognised" variant). Latin/A.VARIANT1 refers to an allograph of grapheme A (i.e. Variant 1 of Latin A). I expect that your goal by the end of the project would be to avoid any remaining option 1 and have everything defined with option 2 (or structurally equivalent), so the extra components only belong to the specific allograph definition.

Tell me if I'm wrong but I believe the current software data model is abstract enough to be refined conceptually (i.e. better mapping it to palaeographic concept) without disrupting that proposed hierarchical mechanism. For the purpose of annotating graphs and categorising variants this should work. But we need to agree on what's what under the hood, in the interface and in the annotation files to avoid confusion.

If I'm correct then the remaining questions are not so much about workflow or hierarchy but more accurate definitions & nature of relationships.

Use cases & Samples

I have some questions about that. But what would really help me is having annotated samples from you and @simonastoyanova of actual variants. I'd like to see instances from a couple of graphemes with a few allographs for each. Each one with a link to the annotation on an inscription in the Crossreads corpus. Preferably with up to date descriptions & definitions (component-features) so the analysis is more concrete for me.

More detailed questions

Q1. do you have a reference to an authoritative & overall coherent set of definitions for most of the concepts used above? Preferably something that is quite precise, unambiguous and also explicit in terms of how the concepts relate. I've seen different introductory texts online (and also Peter's definitions & ontology) but I still find their definitions hard to interpret and translate them into a simple & usable software data model.

Q2. although there is no need to introduce new scripts in Crossreads, the data model need to integrate and differentiate Latin & Greek. That's why I'd like to understand the relationship between script and grapheme or allograph.

Q3. When a grapheme doesn't have multiple variants for the purpose of your study or within your corpus, would you say it has no allograph? Or it has one single implied allograph? My apologies for the very basic/naive question! I think this question is possibly at the heart of my original misconceptions above. (In Archetype it is not possible to describe a graph without a link to an allograph. Even if the grapheme (i.e. character in Archetype) has no other variants in the system).

I guess what I'm also trying to get at is what is the exact nature of the graphical abstraction from allograph to its grapheme? Is it full and a grapheme is formless (i.e. totally agnostic/unprescriptive from its graphical realisations)? Can we distinctly describe a particular grapheme without any reference to its representations? Or does it embed some graphical constraints on the possible variations it allows? i.e. at which point variants evolves so much outside of the parameters of their grapheme that they require the introduction of new grapheme? or is that the wrong way to approach it b/c a grapheme serves mostly as a unit to convey distinctive meaning in writing (i.e. lexical distinctiveness)?

Q4. Can a variant have sub-variants? I.e. is it conceivable to have a hierarchy of allographs for a given grapheme? It may not be needed for CR or is that useful at all? (I guess an idiograph can be a variant of an allograph, in which case variant of variant is a thing. If ever needed the dotted notation could be extended to indicate sub-variation: CHARACTER.VARIANT.SUBVARIANT)

geoffroy-noel-ddh commented 4 months ago

After more thinking & reading I think the conceptual impasse may not be relevant. Because the current data model (despite its current terminological flaws) combined with Option 2 (dotted notation: CLOSEST_UNICODE_CHARACTER.VARIANT_NAME) should meet your functional requirements:

  1. the mapping to Unicode gives you a relatively close match to a grapheme
  2. CLOSEST_UNICODE_CHARACTER.VARIANT_NAME is functionally equivalent to an allograph
  3. that notation gives you the desired hierarchical relationship (you know what it is a variant of and all variants are linked)

Option 1 (temporary multi-variants) may be ugly, and Option 3 (sub-scripting) unnecessary but you don't have to use them.

So my conclusion so far is that the current data model where Allograph = (SCRIPT, CLOSEST_UNICODE_CHARACTER.VARIANT_NAME) may not need structural change.

If a grapheme doesn't have more than one allograph in your corpus, then you have the choice to either come up with a default VARIANT_NAME or maybe just not specify it (i.e. no dot, just CLOSEST_UNICODE_CHARACTER, as it is used now).

The main advantages of the model is its flexibility, simplicity, mapping to external standards (Script & Unicode) and compatibility with Archetype.

If there is agreement on that, then what remains in terms of development:

I'll wait to hear from you and also see your use cases & examples for variants. Let me know if you have any question, see issues or have better suggestions (again, real examples would be extremely useful for that conversation). Happy to meet as well to discuss of course.

JonPrag commented 4 months ago

Thanks very much indeed Geoffroy. This makes lots of sense. I think your entity diagram is 'right', and this implementation of option 2 does work. I think as a working principle even when we don't have multiple allographs, formally it would be wrong to default this to the top-level of grapheme and it should still have a variant_name, even if that means all cases of X are simply CLOSEST_UNICODE_CHARACTER.VARIANT_X1.

It might be helpful / useful at this point to take a quick glance at the CRMtex ontology extension of CIDOC-CRM (not least because in the FAIR Epigraphy project we are mapping a lot of the core epigraphic ontology onto this, since it links into the larger CIDOC CRM ontology), and specifically the couple of classes that touch on this: http://www.cidoc-crm.org/extensions/crmtex/TX8_Grapheme is grapheme as we've been discussing it here http://www.cidoc-crm.org/extensions/crmtex/TX9_Glyph is a single physical instance of a grapheme (as I understand it from their examples) We would then be filling in the levels in between these, classifying the different instances of crmtex:glyph as variants i.e. allographs of crmtex:graphemes (close-match to unicode character). Additionally they add classes of: http://www.cidoc-crm.org/extensions/crmtex/TX13_Script where 'script' is NOT what I take to be the digipal idea of one of several variant scripts within a writing system, but the abstract level above that without differentiation http://www.cidoc-crm.org/extensions/crmtex/TX3_Writing_System a writing system (namely Latin language written in the Latin alphabet). However, I would note that from their examples, at this point they actually seem to be slightly confused as to whether they are considering the top-level abstract idea or actually script variants [and note that the people who have designed this are primarily computer scientists with an interest in this, not actually practising classicists / epigraphers / papyrologists]

Simona and I should liaise and we can definitely provide use cases/examples.

simonastoyanova commented 4 months ago

Thank you both for all this, I've been compiling my examples, thoughts and cleaning/transferring pencil diagrams to Miro. Geoffroy's big graph is cleaner, so let's go with it.

Hierarchical model

My current use and questions: I agree that option 2 makes the most sense for our purposes and have been operating on that model even though it's not obvious in the interface, as we've used the bag of allographs model as a temporary thing. I've started using the tags to mark allographs and variants as 'type1', 'type2' etc. for allographs and 'var1', 'var2' etc. for variants (I've just redone them, not sure why they weren't visible, maybe wait for the indexing cycle). I've done types (read allographs) for Greek Β, Δ, Χ and vars for Greek Α for the different crossbars styles.

I haven't yet done the hierarchical link type1.var1 and have a technical question about whether we can do that automatically through the annotation files or manually. At the moment the tags are listed by number of attestations, so you see type1 but it's not obvious which character it is a type of. A second question, more on the interface side of things, is the annotation worflow once the types and vars multiply: I'm annotating an image, I draw a box around an A, pick the relevant descriptors, then go to the tags box and think 'what was type1?'. Somewhere in the search we should have a list of the types/vars per character - we've already discussed this - but how do we utilise it in the annotation process without constantly flicking back and forth between the annotations tab and the search tab. Maybe I'm overthinking this because it's still theoretical and I've been using the tags without these issues. Thoughts welcome.

Model: Anyway, in terms of the hierarchy model, what I think is most useful is: Character-Allograph-Variant when there are actual allographs (same character made of different bits) or keeping the Allograph as a structural node when there are no actual allographs (same character made of same bits but with variation on some of the bits). I'd rather not flatten the model by removing the allograph node even when it's a default allograph because I think that would only cause confusion and mess with sorting/searching/comparing functionalities. Here are the graphs for both cases:

a) character-multiple allographs-variants

Screenshot 2024-06-06 at 11 39 30

b) character-default allograph-variants

Screenshot 2024-06-06 at 11 39 46

As Jonathan said, there is no miniscule in epigraphic writing, so characters with actual allographs are not that many and mainly in Greek, but we want to keep the model flexible, so I can happily live with a default allograph node for the majority of characters to make sure we have a stable system for both cases.

Examples

Here are some examples from the already annotated inscriptions. The Greek examples are actual allographs, the Latin are variants of a default allograph (a and b in the same inscription). 1) Greek Χ: a) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic030002&img=ISic030002.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/296f764a-d9fd-4044-b2f6-a3884dbe587d&scr=greek-1 b) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic001483&img=ISic001483.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/25158ea5-13d9-43a6-8874-b67852f5d599&scr=greek-1

2) Greek Δ a) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic030002&img=ISic030002.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/c7644926-2da1-457e-b7c6-17f38ce70bec&scr=greek-1 b) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic001481&img=ISic001481.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/0cc6a0b8-b1ce-45ac-a183-c2f572834255&scr=greek-1

3) Greek Β a) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic020602&img=ISic020602.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/8bbe3e4c-f63c-46cb-b124-a1193f3e592f&scr=greek-1 b) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic020317&img=ISic020317.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/9746de96-83d4-4a3f-980f-d79260866392&scr=greek-1

4) Latin M a) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic000163&img=ISic000163.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/1a9b0227-1360-4585-8142-24acd73379eb&scr=latin-1 b) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic000163&img=ISic000163.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/32deec89-96c4-457a-baaf-e909c0422f35&scr=latin-1 c) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic000098&img=ISic000098.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/bd599159-cfad-4e27-8f8f-94f766959b6f&scr=latin-1 d) https://kingsdigitallab.github.io/crossreads/annotator.html?obj=http://sicily.classics.ox.ac.uk/inscription/ISic000093&img=ISic000093.jpg&ann=https://crossreads.web.ox.ac.uk/annotations/43a69713-5edf-4958-8031-6c8d22752adb&scr=latin-1

Search questions

Related but can also move this to a separate issue once we clarify the model. Sticking to the hierarchical model character.allograph.variant should also make searching for co-occuring allographs/variants easier. What I mean is: show me where A.all1.var2 appears in the same inscriptions as B.all2.var1. Having a loose bag of variants will be confusing and messy, too many unconnected drop-downs or scrolls. I've been thinking for a while whether the hierarchy model will be too constricting in such cases, e.g. if some character hasn't been given a type/var yet but you still want to see it in the search result and assign it the type/var at that point. But I suppose you want to do this before looking for correlations and we can always visualise the characters without a tag if we want to. All of this to say: I do prefer the hierarchy model and think it will also make searching more user friendly and consistent, but still have some questions about how we implement it in the search - those are more about functionality and interface than conceptual, though.

Also related is the question from above about the list of allographs & variants, where does it live and how does (and would) it connect to the annotations tab as part of the annotation workflow.

Something I really like about the current search is going straight into the combination component+feature and getting results across both Greek and Latin. One of the issues with the original DigiPal model was multilingual material and the assumption that identical graphical forms represent the same character (article of Peter's, will find the exact citation later). We still define each character in its Script, but the characters from different scripts which are made of the same bits can still be compared easily, e.g. do Greek Ρ (r-sound) and Latin P (p-sound) look anything alike in the same area/time period/genre. In such cases we will sometimes be comparing allographs (Greek B.all1 and Latin B.all1) where in Greek we have actual allographs and in the Latin we have a default allograph (why I think we should keep the default allograph). Other times we will be comparing variants regardless of how many allographs per character there are, e.g. show me all instances of curved serifs, so I can see whether curved serifs are used with similar frequency in both Greek and Latin in the same area/time period/genre. Comparing different graphemes that look the same leads me to my last point, ontologies.

Ontology

I've looked at CRMtex many times and at some point, around the discussion of what is a writing system and what is a script, I start to get confused. Some clarity and lost of good examples in Paolo Monella's graphematics ontology (https://zenodo.org/records/10385857) but when he gets to his alphabeme (alphabetic grapheme) stuff, I wonder whether some of the more abstract parts are something I need to worry about for the purposes of my current work. Peter has been working (with Paolo and others) on making sense and improving all of these bits of ontologies and how they relate to each other: https://github.com/pastokes/archetype-ontology. Still a work in progress but does illustrate how cidoc, crmtex and archetype connect and relate. Happy to discuss more on this with you both and also Imran, but I don't want to get down a rabbit hole since I think our preferred model of character.allograph.variant seems to cover our needs.

simonastoyanova commented 4 months ago

Thinking more on the workflow for assigning allographs (types) and variants: @geoffroy-noel-ddh and I had discussed doing that on the search page rather than from the annotation page. The tags I've added so far have all been done on the annotation page. However, it does make more sense to do that from the search - you select your component+feature combination where you see the structural differences or similarities which allows you to say this is an allograph and that is a variant. You then select the relevant annotations (muti-select needs to be possible), click on a drop-down menu and select the allograph or variant number which those annotatins belong to. Or create a new one. I'm not sure exactly where this menu will live, I have mocked up the functionality I'm imagining here but it's just an illustration of my ideas for now. The blue squares represent selections.

1) selecting relevant variants of Greek A and assigning them to the var number which represents A with ascending crossbar:

Screenshot 2024-06-06 at 16 30 16


2) selecting the relevant allographs of Greek B and giving them the correct type:

Screenshot 2024-06-06 at 16 30 44


I've added thumbnails in the allographs & variants menu to illustrate what the different types and vars mean. On the one hand, it will be useful to see this, so I always know what type1 means in the context of each character; on the other hand, I wonder if it clutters the search too much and how else we might visualise the types/vars.

If we follow this workflow, it will also be useful to see which annotations have already been assigned types/vars and which are new and don't have them yet. So adding this information in the annotation thumbnail might be the way to easily see which annotations need to be assigned types/vars without having to go back to the annotation page or scroll though a very long list of null tags (which we tried and decided to hide for now). Mock-up here:

Screenshot 2024-06-06 at 16 46 49


Might be something more human-friendly and better looking but you get the idea. This may also eliminate my question from above about adding small thumbnails to the allographs & variants menu - let's say I search for all Greek A with ascending crossbar. I will see all annotations, with or without tags. So I will immediately see the var1 tag in the older annotations and will know that this is the tag to use for the newly added annotations. This process doesn't work when I need to add a new type/var, since I still need the list of all types/vars for each letter to see what they represent. We've discussed that such a list (I'm not calling it a vocabulary quite yet but I will try to evolve it into one) could 1) be generated on the fly from the search or 2) live in a separate page but updating it will be trickier. Either way, the list should be visible when assigning types/vars.

For now I'm keeping these questions in this thread because they are related to the conceptual conversation and how we define/manage allographs and variants. Happy to restructure and move to a separate issue in the future.

geoffroy-noel-ddh commented 4 months ago

Thank you very much @JonPrag and @simonastoyanova for the responses and the illustrations! Sorry for the late reply I have been out of action for a few days then faced with other priorities when I got back.

I'm glad we have some agreement on the annotator data model and the way it is used in practice with the dotted notation.

I think as a working principle even when we don't have multiple allographs, formally it would be wrong to default this to the top-level of grapheme and it should still have a variant_name, even if that means all cases of X are simply CLOSEST_UNICODE_CHARACTER.VARIANT_X1 - Jonathan

I'd rather not flatten the model by removing the allograph node even when it's a default allograph because I think that would only cause confusion and mess with sorting/searching/comparing functionalities. Here are the graphs for both cases - Simona


Flexibility vs consistency

When I read the various definitions from CIDOC, Unicode, DigiPal, etc. I notice some ambiguities and misalignments. I think the flexibility of the data model in the annotator can work for us in that it doesn't have to exactly and fully match one ontology as long as your definitions and descriptions based on it can be sufficiently well mapped to the key concepts (i.e. allograph and grapheme). However the downside of this flexibility is that unless the hierarchical notation you adopt is (eventually) clear and consistent there's always risk of misinterpretations and discrepancies.

Variants

For instance I see that you are proposing a third level in the hierarchy. Is my understanding correct that in your system different allographs of the same character/grapheme are necessarily differentiated by their components, whereas what you call variants, of the same allograph, are only differentiated by their features?

My only reservation with the terminology is that allographs seem to be commonly defined as variants (of grapheme), so the term variant, distinct from allograph may lead to confusion. I'm also wondering, in your examples above if you consider Latin M.Allograph1.Variant1 a different allograph from Latin M.Allograph1.Variant2? Another question: if the variation at that level is about patterns of component-features rather than structural (different components) then what is you main purpose for the definitional refinement? I'm thinking that patterns of component-features for a given allograph could be dynamically displayed by the annotating environment (see #66 ). In other words why do you need to systematically crystalise patterns of features-only variations into their own definitions (below that of an allograph)?

Your questions

Q: "I haven't yet done the hierarchical link type1.var1 and have a technical question about whether we can do that automatically through the annotation files or manually."

I think this can be extracted automatically. Some of the stats can be obtained from the search facets by selecting an allograph and looking at the counts for the tags (and vice versa). If you need a more more systematic breakdown or visualisation, that's certainly feasible as well with a bit of development.

Q: how do we utilise it in the annotation process without constantly flicking back and forth between the annotations tab and the search tab

If I understand correctly you'd like, at the point of tagging, see examples of how that tag has been applied to that grapheme or allograph to remind you of your classification? That's a nice idea. I think it should be feasible to show a popup with for the selected allograph, a list of tags applied to it in the other annotations and some sample thumbnails. Would that be useful? Does that overlap with #66?

Q: Also related is the question from above about the list of allographs & variants, where does it live and how does (and would) it connect to the annotations tab as part of the annotation workflow.

I think it first live as a combination of definition file & annotations. Then as you gradually identify the variants and settle them into their own definitions the categorisation is more explicitly formalised in the definition file with less dependency on the annotations. Some dependency remains for evidential purpose but less for definitional/discriminating purpose.

I can try to better answer that question next time we meet. I think a good understanding the underlying data model is very important. A key point, I think, is that, in the data model, (Greek) "A.type1.var1" is just one thing (the name of an allograph). There is are no separate entities or records for Greek A, type 1 and var1. But the annotating environment can see the dotted notation and make use of that hierarchy for various functions (faceting, searching, annotating, bulk editing, etc.).

Writing system, Script and Alphabet

I've also seen Peter's ontology before I made the data model and compared that with the legacy Archetype one. Relationships among Entities have significantly changed over time. As you say this can turn into a rabbit hole. The annotator only has a loose concept of script at the moment. And it is dual/redundant. You define Latin & Greek and add allographs to them. As agreed, the dotted notation for an allograph starts with the name of the grapheme. Of more accurately the closest unicode character. I believe the ISO 15924 script can be looked up from a Unicode character.

The pragmatic reason the script is still defined in the data model is that some graphemes may not have a good match in Unicode and it also allows for the definition of scripts that don't exist in Unicode or ISO 15924 (although that's not needed as part of Crossreads).

Bulk editing

The editing interface & workflow you suggest makes sense to me and seems feasible. Thanks for the mockups!

Let's implement it incrementally (e.g. allographs, then variants, then thumbnails, how the illustrative thumbs are picked, etc.), so you get some of what you need most sooner and can give me feedback while I can better understand your use cases and think about the more advanced requirements & technical implications.

simonastoyanova commented 3 months ago

Thank you @geoffroy-noel-ddh! In the last couple of weeks Jonathan and I have had several meetings on this and have revised the data model. We also held a workshop with a number of epigraphers and discussed this with Peter specifically. Here is a link to our Miro board, you should have viewing access, and I'll go over the recent decisions: https://miro.com/app/board/o9J_lT7UmXg=/?moveToWidget=3458764591760620987&cot=14

In the graph you'll see the alignment to the relevant ontologies, as much as we think these concepts map to what we mean by grapheme etc., with links to the definitions. The refinements to the previous model in the above comments are:

  1. character level - keep in the model but make it optional: when we can map to Unicode (when it makes sense, not only when the code for a character exists, e.g. upper and lowercase A/a both exist but not always applicable as different characters in the material); it needs to be part of the model but not necessarily have a drop-down to select a character under a grapheme. In some cases multiple characters for the same grapheme could be considered variations/allographs depending on their shape/construction or the evolution of writing, and in other cases they will be distinct shapes which need to be described as separate characters with their own allographs under them. So in terms of structure this will be grapheme.character.allograph where character could either be 'character1', 'character2' etc. or just 'character'.
  2. suballograph - this came up in the workshop as desirable for some types of writing but is not required for all; very much dependant on the level of granularity when defining allographs and variants within allographs, and what is considered enough variation to constitute an allograph. This level of distinction will never be universal between projects, so we have allowed for it in the model on a project-by-project basis. It can be completely absent as in the first graph with the alpha examples, or it can be present as in the second graph with the omega examples. The structure here will be grapheme.character.allograph.suballograph or grapheme.character.allograph.


Screenshot 2024-07-08 at 10 06 51


Screenshot 2024-07-08 at 10 07 07


For my purposes in Crossreads, after the initial experiment with the var1 definitions, I've decided not to over-define the slight variations in allographs. It was useful to do a small run and see if it makes sense for this material but it doesn't, so I'm happy to have grapheme.character.allograph as our structure in the project and further search through the allograph variation simply using the feature search on the graphs themselves. So I'll go ahead and remove those tags from the annotations.

All this also shifts my three questions from above. I won't need to assign tags related to variation at the point of annotation. I will still need the bulk selection and tagging at the point of search results, though. So when I'm comparing allographs, I can assign the types by the clusters of similarity coming from the annotation. So I select a bunch of As with the same component-feature and give them all 'type1'.

Agreed re the writing system, alphabet and script. All definitions we've seen are quite vague and open, which does make sense but not always helpful. We keep the Script level in our model as we have been using it in the annotator. Jonathan has been talking to Imran a bit more about further developing the ontology but more on that will be coming later.

Hope this helps, we'll discuss further.

simonastoyanova commented 3 months ago

Here is @JonPrag's revised graph of the simplified data model, following from our discussion today:

Screenshot 2024-07-10 at 16 36 56