stenskjaer / samewords

Automatically annotate potentially ambiguous words in critical text editions made with LaTeX and reledmac.
MIT License
7 stars 1 forks source link

Compare search words as lower case? #3

Closed stenskjaer closed 6 years ago

stenskjaer commented 7 years ago

Would it be better always (on subject to customizations) to compare words as lower case instances. The words in the critical apparatus may appear in lower case, thus creating ambiguity that would not be caught if Titlecased and lowercased words are compared.

On the other hand. By using the lemma words as the form of the search word, I guess the problem is not so much with transformation of words between maintext and lemma appearance, but on comparing lemma words with other instances in the text.

Example 1 (true case lemma entries):

An example of an (un)ambiguous case. 1 an ] om. P

Lemma: an². Would not match on searching context. But it would also not be ambiguous, as the appearance of the two is different. This could be distinguished from the alternative:

An example of an (un)ambiguous case. 1 An ] om. P

But: The first example could still confuse a reader who expects any lemma word to be lower case.

Example 2 (always lower case lemma entries):

An example of an ambiguous case. 1 an ] om. P

With the practice of always lower casing the apparatus lemma (a decision that samewords should be agnostic to), this would be ambiguous.

Both examples here may lead to confusion.

Idea 1: Lower case context words

If we lower case the context words before comparison:

  1. Matches will occur when lemma words are always lower cased, regardless of whether the context word is lower or titlecased.
  2. Matches will not occur when the lemma is not lower cased (except of course the line contains the same word in titlecase more than once). But it will also not be ambiguous as the lemma form obviously is titlecased.

Idea 2: Lower case both lemma and context before comparison

This way the annotation would be as explicit as possible. This might lead to some redundancy in annotation and disambiguation, but should not leave any room for doubt.

Unless of course:

An example of an (un)ambiguous case. 1 an¹ ] om. P

Could be interpreted to refer to the first lower case instance. This would cause confusion too, as there is only one such instance.

stenskjaer commented 7 years ago

I think I'll add this as a configurable feature (parallel to the option in the reledmac in the referenced issue).

stenskjaer commented 6 years ago

Now the default setting is insensitive comparison. Case sensitive comparison can be activated with the sensitive_context_match config switch.