How to treat homographs in orthography.tsv?

lexibank / pylexibank

The python curation library for lexibank

Apache License 2.0

18 stars 7 forks source link

How to treat homographs in orthography.tsv? #263

Open martino-vic opened 2 years ago

martino-vic commented 2 years ago

In my orthography.tsv there are now 13 words that are spelled the same way but should be transcribed differently, e.g.

Language_ID	Grapheme	IPA
H	alma	ɒ l m ɒ
EAH	alma	a l m a

which leads to WARNING:segments.profile:line 21:duplicate grapheme in profile: alma when I run the lexibank script.

xrotwang commented 2 years ago

I think in this case, you'd need to switch to using per-language orthography profiles. See https://github.com/lexibank/pylexibank/blob/8ae170cecb67f450b7a8cbaa56ded94281944b0f/src/pylexibank/dataset.py#L138-L148 for details.

LinguList commented 2 years ago

Yes, @martino-vic, we have enough example cases for these profiles. As a workaround, you can also list the 13 cases in your code and do their segmentation directly. Just add a dictionary with language and value and I later check how one could do the segmentation here as a workaround.