Sound corresp - Githubissues

XachaB commented 3 years ago

Here is a big PR on the sound correspondence code. These are all the changes and improvements made since start of November, while working with @erichround (does he have the rights to view this repository ? If not it would be good if he did) and which led to the presentation right before the holidays.

I cleaned up the code and added some more features while preparing the PR, such as the export of a few examples with the counts.

Among important changes:

We use a custom coarsening to ignore some of the noise in the data.
ASJP is now usable only with the ASJP dataset, for comparison purposes.
Forbids the alignment of tones with other types of sounds
The similarity threshold for cognacy is a proportion of the number of syllables in words
The similarity threshold for cognacy is computed on SCA strings (but the actual alignment isn't)
Addition of contexts
The cutoff is proportional to the size of the list (with a minimum of 2)
We export metadata in order to track the parameters which lead to results, including an account of the coarsening classes, with the coarse features.
Refactoring and docstrings for documentation and legibility.

This can be run on a single dataset as:

 cldfbench lexitools.correspondences --clts-version v1.4.1 --model Coarse --cutoff 0.05 --threshold 1 --dataset lexibank/kraftchadic

Or on the hard-coded lexicore list as:

 cldfbench lexitools.correspondences --clts-version v1.4.1 --model Coarse --cutoff 0.05 --threshold 1 --dataset lexibank/kraftchadic

SimonGreenhill commented 3 years ago

(just sent an invite to Erich)

XachaB commented 3 years ago

Thanks for the reviews ! I did my best to document things enough so that a few weeks (or more) without touching the code wouldn't hurt later progress.

As to the feature reduction system: This is essentially some kind of a sound class system, and I'd love to include it on the longer run in CLTS. I wonder only if it can be done in a slightly more transparent manner, but this is something that one can easily test later. Having reduction systems inside the pyclts code is generally important to make sure they are in line with the most recent pyclts version, I think, although we hope that the most recent changes were the last changes we did so far.

I agree that it would be better somewhere in CLTS. The compatibility issue is already pertinent: right now it depends on v1.4.1, and I have not investigated yet how to update it for the new version. Erich and I are not completely done refining the coarsening scheme itself, so maybe it should all better wait until this is done.

I am aware that the coarsening might be a bit hacky, as I don't know the inner workings of CLTS well enough to formulate something that would mesh seamlessly with existing systems. If I understand properly, current sound class systems are given as direct mappings of sounds, but I think there is value (and a form of transparency) in being able to specify reductions or transformations on the features instead -- this makes it easier not to forget some specific diacritic combination.

That said, I unfortunately doubt I will have much dev time for this project in my new job, and when I do, the priority will be towards steps that take us closer to a written paper.

One point to discuss and test is: if the argument is that the SCA-cognate detection can be used by the alignments are done with edit distance, one could also use another cognate detection method and see what this changes.

I'm not sure I follow your summary of the argument, but:

I agree that we could add a mechanism to swap out cognate detection methods.
- I saw that lexstat in lingpy is intended for this, but I think I understand that you have several possible algorithms to do cognate detection. Do you have a suggestion of what we should try ? A pointer to some class in lingpy, some code which uses it, or examples, would help me.
Moreover, we have some gold cognacy data in Lexibank, and I think we probably should use it.
- that is... unless we are afraid it again would introduce too much bias ?
- We could use the COGID when the dataset has it
- For ST languages, we sometimes have COGIDS and segmented words, and what we are currently doing is probably catastrophic, we should instead be comparing parts of words. I know @MacyL has been trying to figure out the best way to decide how to match partial words, but I am not sure what exactly I should do for this.
- Note that we would still need detection for when comparing across datasets, and when the datasets does not have gold cognate sets.

For now, I'll add a few "TODO" indications in the code for these questions, but we should discuss them further.

I'd furthermore also argue to disallow vowel-consonant matches, since the whole discussion in phonology about glides becoming vowels and the like is barely met in real-world alignments, but rather in morphology, and we explicitly want to deal with sound change. But this is for later discussion.

While this may be true, the issue is that the fact that we do find neat C/V clusters with transitions via glides is currently one of the results of our work (and while the hypothesis is trivial, it is cool to have it quantified). If we specify it in input, then we can not claim anymore that it is a result. If we don't mind that, then we surely can do this C/V restriction.

I'll re-run everything on the full lexibank today, try to spot any leftover problems, and merge after fixing them.

XachaB commented 3 years ago

It turns out that my refactors while improving readability made the code much slower and more memory hungry, to the point that it quickly fills my 15G of RAM. I will spend a little while trying to optimize both.

lexibank / lexitools

Sound corresp #14