sillsdev / cog

Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques.
http://sillsdev.github.io/cog/
MIT License
23 stars 10 forks source link

Recommended Blair Method's Setting #59

Closed paschawu closed 7 years ago

paschawu commented 8 years ago

Can you recommend the best setting for a total of 1000 wordlists (10 languages, each 100 wordlists)?

ddaspit commented 8 years ago

Cog follows the method exactly as it is outlined in "Survey on a Shoestring" by Frank Blair. You can find out more about the Blair method at this site. I would recommend reading it. Cog does support some variations on the Blair method that might be appropriate for your language family. For example in Mainland Southeast Asia, we ignore insertions/deletions that occur regularly. You can configure Cog to ignore correspondences that occur in a specific environment as well. For example, you could configure Cog to ignore word-initial nasals. I would use the "Automatically determine regular correspondence threshold if possible" setting. This will tell Cog to try and figure out the best threshold for determining if a correspondence is regular. It does this by looking at the size of each word list and the number of occurrences for each segment. This should provide a better threshold than the default threshold of 3. That threshold was really intended for a word list size of 210 that Blair used. One of the key things is to specify which correspondences are similar for your language family. By default, Cog uses a phonetic threshold, but the similar segments are normally customized for each language family. You can find out more about each setting on this page. Let me know if you have any questions about specific settings.

paschawu commented 8 years ago

The language family that I am working with is Western Malayo-Polynesian (Austronesian language) and it seems that checking "Ignore regular insertions/deletions" and "Regular consonant correspondences are treated as Categori I" do the work. But I stil confuse about "Likely cognate identification" threshold setting. Any theoretical base as to why the default value is 30%? In my case, I set it up at 70% because below that value, the resulting tree doesn't match the language family very well. Thanks beforehand!

ddaspit commented 8 years ago

I am glad that the "Ignore regular insertions/deletions" and "Regular consonant correspondences are treated as Category 1" options are working for you. We use those settings as well when comparing our word lists here in Southeast Asia.

Cog uses the "Initial cognate threshold" setting in the "Likely cognate identification" section to deal with a chicken-and-egg problem when determining cognates. Determining a cognate is normally dependent on first determining the regular sound correspondences for a language variety. Determining regular sound correspondences are dependent on first determining the cognates. As you can see, this poses a problem for Cog. In order for Cog to overcome this, Cog first makes an educated guess as to which word pairs are cognate. It can use these guesses to find regular correspondences, which can in turn by used to figure out which word pairs are cognate. It continues in this iterative fashion until things stop changing. Cog makes the initial guess at which word pairs are cognate by looking at all word pairs that are phonetically similar using the "Initial cognate threshold" setting. I set the default threshold to a low value so that Cog wouldn't miss possible cognates for most word lists. If a cognate word pair isn't included in the initial guess of cognates, then it most likely wouldn't be found to be cognate. By setting it to a higher threshold like 70%, Cog will be pickier about which word pairs are considered cognates initially. It sounds like a higher value gives better results for your data, which is fine. You can get more details about how Cog uses the threshold setting here.

paschawu commented 8 years ago

Thanks for your fast response. The problem with 30% in my case is sometimes two identical word structures, for example "kinaq" and "lihat" are classified as cognate, although they clearly are not, judging from the already-reconstructed proto language. I also disallowed "Reward segment pairs proportional to their frequency of correspondence" and "Penalize segment pairs in different syllable positions" options, will it affect the Blair Method that I used? Sorry for bothering and for my English :+1:

ddaspit commented 8 years ago

Your English is really good. I didn't even realize that you weren't a native English speaker until you mentioned it.

Did you know that you can see why Cog decided that a particular word pair was a cognate? First, navigate to the Variety Pairs view, then select the two varieties for the word pair that you are interested in. On the left, all of the word pairs are split up into two sections: cognates and non-cognates. Find the word pair that you are interested in. You can search by gloss or by form. You will be able to see how the words are aligned and how each segment pair was categorized using the Blair method. The numbers under each segment pair indicate the Blair category. This should tell you why Cog thought that "kinaq" and "lihat" were cognate.

The two options you mentioned have to do with how word pairs are aligned. The "Reward segment pairs proportional to their frequency of correspondence" setting means that segments that often correspond will be more likely to correspond. The "Penalize segment pairs in different syllable positions" setting means that Cog will encourage words to align along syllable boundaries. These settings do have an affect on how segments in words are aligned. Blair uses the these aligned segment pairs to determine if a word is cognate or not.

paschawu commented 8 years ago

Brilliant answer! Thank you very much ddaspit! Terima kasih.