proycon / analiticcl

an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction
GNU General Public License v3.0
31 stars 4 forks source link

Question: Alphabet file, punctuations and diacritics #16

Closed pirolen closed 1 year ago

pirolen commented 1 year ago

I have a question regarding the creation of the alphabet file and punctuations.

In order for punctuation marks to be ignored when computing variants, shall (expected) punctuation marks and diacritical marks be listed in the alphabet file in a specific way?

(Or shall the data be preprocessed and stripped of these?)

Or shall they be listed in the same way as e.g. numerals (cf. the documentation)?

There are a lot of diacritical marks in my data, these are often separately encoded unicode characters; Short excerpt from https://www.unicode.org/notes/tn41/tn41-1.pdf

Screen Shot 2023-01-26 at 17 31 03
proycon commented 1 year ago

There are a lot of diacritical marks in my data, these are often separately encoded unicode characters; Short excerpt from https://www.unicode.org/notes/tn41/tn41-1.pdf

If you want to ignore diacritical marks and consider a á à ã ä as exactly equal, then put those symbols on the same line in the alphabet. If you do want analiticcl to distinguish them, then each is on a separate line in the alphabet file.

Regarding punctuation, analiticcl knows (via unicode) what an alphabetic character is and what is not. Anything that is not alphabetic (numbers, punctuation, emoji) will be used to signal a possible boundary of a token. So preprocessing and stripping your data of punctuation shouldn't be necessary.

The alphabet should still contain the punctuation, and you can put punctuation you consider exactly equal on the same line.

Any character that's not in the alphabet will get an "unknown" label and analiticcl will treat all unknown characters as if it's a single unknown one.

There are a lot of diacritical marks in my data, these are often separately encoded unicode characters; Short excerpt from https://www.unicode.org/notes/tn41/tn41-1.pdf

Make sure the input is in Unicode Normal Form C (stands for Composed), this will ensure the combining diacriticil marks and their targets one codepoint whenever possible. But I admit I'm not sure to what extend this holds for Old Church Slavonic and what the effects might be on analiticcl if a character is unduly split. Fortunately, alphabet entries in need not be a single unicode point for analiticcl, something like this is perfectly valid:

ae æ AE Æ
oe œ
ue ü

Just out of curiosity I'd be interested in seeing your alphabet file when you're done with it :)

pirolen commented 1 year ago

After a bit of research, this looks like a complicated topic :-( E.g. https://www.unicode.org/reports/tr15/tr15-53.html#Concatenation

Make sure the input is in Unicode Normal Form C (stands for Composed), this will ensure the combining diacriticil marks and their targets one codepoint whenever possible.

Did you mean NFC? But not NFKC?

But I admit I'm not sure to what extend this holds for Old Church Slavonic and what the effects might be on analiticcl if a character is unduly split.

In https://www.unicode.org/notes/tn41/tn41-1.pdf on Old Churc Slavonic, I did not find (via quick search) these acronyms (NFC etc), but there was text about pre-/decomposition; I am planning to read it more thoroughly.

Maybe I will just proceed with whatever data and alphabet I have right now, and observe how the analiticcl output behaves and try to understand what unicode actions need to be taken to improve it.

pirolen commented 1 year ago

I would not mind reopening the issue, now that I did a bit of exploration of the unicode characters in my data.

If I understand right, my data involved combining characters from the private unicode area. I am not sure how these should be represented in the alphabet file. They render quite arbitrarily here in the browser; in e.g. VS Code one sees rectangles if using ud.normalize('NFC', mychar) and a generic font (Menlo) and printing the char name:

E.g. Aда́мово ['A', '\ue012', ...] Aзь 4 ['A', '\ue00e', 'з', 'ь'] Iу҃ 4 ['I', '\ue012', 'у', '҃']

Screenshots when using the font Bukyvede:

Screenshot 2023-03-30 at 19 39 14 Screenshot 2023-03-30 at 19 31 38

Apologies for the troubles... :-/

proycon commented 1 year ago

After a bit of research, this looks like a complicated topic :-(

Indeed, one normally doesn't think about it but there are quite some important details when normalizing scripts so they're automatically comparable.

Did you mean NFC? But not NFKC?

Yes, I meant NFC but for the purposes of analiticcl NFKC would be fine too, or perhaps even better (it's just a bit more distant from the original text).

The most important point is that all resources (lexicon, alphabet + your input texts) use the same form of normalization. What exact normalization it is is less important then.

If I understand right, my data involved combining characters from the private unicode area. I am not sure how these should be represented in the alphabet file.

The fact they're in the private use area is not an issue, although it probably means there are no composed forms.

Aзь 4 ['A', '\ue00e', 'з', 'ь']

A + that diacritic in \ue00e probably do not have a composed form in unicode itself. So can just put the combination (A) in the alphabet file. Just make sure to add it before A itself (long patterns first before short patterns because analiticcl will match greedily). I think that's an important point I might not have stressed before yet.

I hope this gets you a bit further?