Extended wordlists in borrowing detection and cognate identification

lexibank / sabor

CLDF datasets accompanying investigations on automated borrowing detection in SA languages

Creative Commons Attribution 4.0 International

0 stars 0 forks source link

Extended wordlists in borrowing detection and cognate identification #34

Open fractaldragonflies opened 2 years ago

fractaldragonflies commented 2 years ago

This began as a path forward priority from analysis of non-detect errors in donor focused borrowing detection. i.e., Many errors were due to lack of the known donor word from WOLD in the donor source wordlist, i.e., ids-Spanish wordlist. I've captured our discussion in email and have moved into this issue.

from @LinguList:

The major problem is that we have borrowings with shifted meanings in our wordlist, for some borrowings we even have the source word. The obvious way to deal with this is to use a Spanish dictionary with > 3000 entries (maybe extended whatever wordlist from, say, a Wordnet project) and then run a all-to-all comparison here. So each word in a target language is compared to all 3000 words in our dictionary.

However, initial tests I made on cognate detection showed that this increases the noise quite drastically, so it is not feasible to compare all against all, since there will be many more chance resemblances. So we need to control for meaning in some way.

This is where it is getting interesting, and we have to discuss how we can deal with that (I have some ideas based on colexifications). Please check this paper by Arnaud et al: "Identifying Cognate Sets Across Dictionaries of Related Languages" (it is freely available via the ACL website). It deals with comparing whole dictionaries. We don't do that, but it'll help you to read their work.

So our next steps as I see them is:

work on a larger list of > 3000 concepts. We can use maybe concepticon, as we have a merger of several concept lists there, your idea of expanding the original Spanish list is also possible, or we add words from https://1000mostcommonwords.com/1000-most-common-spanish-words/ something along this line, arguing: we compare against the most frequent words now.

make initial tests of the methods by checking what happens if we compare ALL words against each word in the target languages to confirm my intuition based on previous studies.

discuss next steps regarding semantics

I took the path of using concepticon via pyconcepticon to construct a list of Spanish words. Just shy of 3000 entries. Only using the forms returned by concepticon with some cleanup to replace use of pattern “/” with the intended alternatives as separate forms. Only suitable for Spanish and Portuguese as I did not try to generalize. NEXT I will create an orthography file for converting the vocabulary given as Spanish orthography to segmented IPA, as I did previously when we accessed the Spanish wordlist from IDS.

from @LinguList

Well, what we have here is an extended wordlist that we compare against an existing dataset. So we have two datasets, on the one hand, but on the other hand, we can also say we have one dataset with one larger wordlist.

If we find that this Spanish list is useful, we might want to propagate using it on other datasets. So maybe, we better assume it is ONE dataset in its own rights, and we adjust the code to work with two CLDF datasets? That is definitely trivial, and we can announce the Spanish larger list with a blogpost. That does not bring you fame, but if you write some 2 pages on this list, we put it in a CLDF dataset, etc., we can quote it, and you get quotes, and we can use it when doing follow-up studies (I am talking of blog posts in https://calc.hypotheses.org/). What do you think?

I've attached the Spanish wordlist, with IPA and segmented IPA also added. Largely cleaned up with 2918 entries. Subsequent comments document the progress on this dataset candidate and related sub-issues of extended wordlists. spanish_forms_ipa_tokens.txt

fractaldragonflies commented 2 years ago

I performed an evaluation of the newly derived spanish_forms dataset versus my previous error analysis over concepts A-F of the Sabor dataset. I observed that very few of the missing donor source words were resolved with the new dataset.

The following concept-forms had been added (good): BELOWORUNDER - abajo, BLANKET - manta, BOAT - barca, CLOTH paño, CHOKE - ahogarse.
But many more are still lacking: e.g., BILL - boleta, BLANKET - chamarra, BOAR (MALE) - varraco, BOAT - barco, BRAVE - guapo, BRICK - adobe, CANDLE - candela, CENTIPEDE - cientopies, CHISEL - formón ...

@LinguList

... does it mean, there are still many source forms in Spanish in WOLD for borrowings which we do not capture? If this is the case, we should make an empirical investigation here, listing all source forms in one file and checking the overlap. Because this would help us to assess from WHICH part of the lexicon these concepts are!

Note that: if borrowing happened 200 years ago, we are dealing with different frequency distributions, so the top scoring most frequent words may have shifted by now. Note also: a list of borrowing sources from Spanish (be it for the extended Pano data or the SABOR data) would allow us to do some additional analyses to check:

overlap with concepticon

age of the word in Spanish

frequency

semantic shift

overlap with Spanish IDS

Not that we need to do all of it, but this helps us to assess what lists are USEFUL for borrowing detection in a Latinoamerican setting. And THIS is a publication / discussion in its own rights I think. One could even check embeddings in corpora to see if these concepts have shifted there meanings over the last 200 years, but I have never done these kinds of analyses.

But starting with the list and then checking for frequency (using some available frequency dictionaries of Spanish (?)) would already give us some interesting ideas, as well as comparing overlap with IDS Spanish and Concepticon-Spanish-Glosses.

Our method to populate the donor lexicon has to be based something like frequency (as you mentioned), or maybe from dictionary entries for the English word concepts. Clearly, based on the Sabor analysis of errors, word borrowing does not coincide with prototypical concept glosses. Actual language use is more varied and less determinate.

@LinguList

... Instead of checking which words are there in Concepticon, etc., we just pull out all donor words in SABOR and then use the automated matching algorithm from Concepticon to link them and then we check the overlap. That can also be manually refined, but maybe one can even do without that (or do it quickly). In any case, the question, what kind of overlap we'd expect etc., can in this way be solved rather easily.

We could then also take some simple dictionary, no idea which, but there must be some out there on Spanish, which, maybe also provide some vocabulary with information on frequency, and we run the "borrowed words list" from sabor against that, to see what words are really rare. And then we need to keep one thing in mind: if the dictionary shows some 90% coverage, but our Concepticon only 50% of these words, we will want to know what impact the usage of a full dictionary has on our analysis! If you don't control for meaning here, it may be well disastrous, with loads of chance resemblances.

So the tradeoff we need for any study is: sacrifice a couple of words we may not find, or may need to find manually, to avoid that we find patterns EVERYWHERE, especially where they don't belong. This is something I'd like to show in a study for once, since people often do not believe me when I tell them that more is not necessarily also "better".

So concrete steps to think about:

our concrete donor words from sabor as a distinct word list

map the list to concepticon using for example pysem (there's a blog post on how to use it in https://calc.hypotheses.org/3193)

try to find a good digital dictionary that we could use (if not, I can crawl something like the PONS)

I'm working on steps 1 & 2. Progress is reported in the subsequent comment.

Having played with matching words by pairwise or multiple alignment methods, I appreciate the risks of chance alignments if not restricting in some fashion. We restrict based on concept or semantics with our multilingual methods, and by pairwise differences between competing models for our competing cross-entropies methods.

You’re right, while the risks will grow the more words we throw into the mix, we really haven’t quantified the risk. Something we could consider for a full fledged investigation of this:

Add ASJP, Swadesh or similar wordlists ~200-300 entries to nail down the low end of risk,
Our current Sabor wordlists, we have a lower limit of ~1,000-2,000 entries, and
As you suggest Dictionaries ~10,000 (bigger dictionaries would make problems even more obvious but maybe not necessary).

fractaldragonflies commented 2 years ago

Finally I have pretty complete results for overlap studies between Spanish known donor forms with Concepticon forms (2,900 forms) and with IDS Spanish source forms (1,500). I report 2 ways: 1) unique donor form basis, and 2) target language form basis (which represents frequency of use).

The target language basis is more definitive as it uses known donor forms and actual IDS Spanish source forms used in the borrowing detection study. Tendencies are the same across analyses.

Bottom line, the lack of forms to even match form the donor source makes it impossible to improve performance much. Both the concepticon based database I created (2,900 entries) and the current IDS source lack ~25% of known donor forms (389 not matched concepticon base, 368 current IDS Spanish source). The problem of matching on concept, given that the form is present, is less severe (~10%), but we don't know for the cases where the form is lacking altogether.

% python src/words_to_concepticon.py
Analysis based on concepticon concept table.
Concepticon items (unique forms) basis:
747 glosses
281 no matched forms
466 matched forms in concept lists

Spanish donor wordlist basis:
1480 glosses
389 not matched forms
1091 matched forms in concept lists
  925 same concept match
  166 different concept match
26.3% no-coverage (words without concepts)
73.7% coverage (words with concepts)
  62.5% same concept coverage
  11.2% different concept coverage
Wrote 389 forms to file not_matched_spanish_donor_forms_NR.tsv.
Wrote 925 concept-forms to file same_concept_spanish_donor_forms_NR.tsv.
Wrote 166 concept-forms to file different_concept_spanish_donor_forms_NR.tsv.

Analysis based on IDS Spanish donor table.
Read 1683 form_concepts to file spanish-ids-form-concepts.tsv.
Concepticon items (unique forms) basis:
747 glosses
260 no matched forms
487 matched forms in concept lists

Spanish donor wordlist basis:
1480 glosses
368 not matched forms
1112 matched forms in concept lists
  1000 same concept match
  112 different concept match
24.9% no-coverage (words without concepts)
75.1% coverage (words with concepts)
  67.6% same concept coverage
  7.6% different concept coverage
Wrote 368 forms to file ids_not_matched_spanish_donor_forms_NR.tsv.
Wrote 1000 concept-forms to file ids_same_concept_spanish_donor_forms_NR.tsv.
Wrote 112 concept-forms to file ids_different_concept_spanish_donor_forms_NR.tsv.

These results are entirely consistent with our no-detect (FN) results on borrowing detection using sequence alignment based methods (which require that the source form be present too).

Results from evaluation of Sabor training over entire dataset.

Language             tp    tn    fp    fn    precision    recall    F1 score    accuracy
-----------------  ----  ----  ----  ----  -----------  --------  ----------  ----------
Overall            1055  8779    74   422        0.934     0.714       0.810       0.952

Here are the output files that correspond to the cases of concept-form matches, form-only matches, not-matched with the IDS-Spanish source forms ids_different_concept_spanish_donor_forms_NR.txt ids_not_matched_spanish_donor_forms_NR.txt ids_same_concept_spanish_donor_forms_NR.txt .

fractaldragonflies commented 2 years ago

So, this is where we are at...

We've shown definitively that lack of corresponding lexical forms in the donor source language, is largely responsible for non-detect with our sequence alignment methods.
- We've seen this is both the extended dataset based on Concepticon Spanish and in the same IDS Spanish source that we used in our analysis.
- [Btw, the competing entropies method is less problematic here since it doesn't require a competing form, competing models are trained on available forms. Of course, more training data from the missing forms would help in training it to perform better too.]
For cases where the form was present in the IDS Spanish source, ~10% were from different concepts and so not matched by alignment methods either.
- For cases where the form was absent, we don't know what % would be from different concepts. Based on my analysis of Sabor results (sample of concepts in A-F), a much larger %.
Use of the extended dataset from Concepticon Spanish, is at best incrementally better than using the original IDS Spanish wordlist.

So the next steps would be:

Develop dictionary of 5,000 or 10,000 forms and evaluate to see if this would offer ample coverage of forms.
Perform the experiments of borrowing detection without the restriction to same concept, on both the current IDS Spanish source, and the new dictionary, assuming we achieve adequate coverage.
- Since only 10% of forms are from different concepts for the IDS Spanish wordlist, it's really with just the larger dictionary that we might see an impact of these experiments.
Develop 'oracle' dictionary [new idea] which miraculously has all Spanish donor words in its wordlist. Use this in experiments above, and to show the potential effectiveness of our detection methods when the dictionary is adequate.
With this in hand we can finally experiment with a partial relaxation of the same concept restriction, assuming that our results above say no restriction results in way too many false-positives.

I'll start looking to extend to dictionary to the 5,000 to 10,000 level. Taking into account your suggestions of 1000 most common words and digital dictionaries. Other suggestions?

fractaldragonflies commented 2 years ago

I downloaded a bigger Spanish dictionary of 8,600 entries (with each form unique) and checked for coverage using this dictionary. Initial results were surprising in that it performed poorer than the IDS Spanish wordlist (75% coverage IDS, 67% coverage bigger dictionary). Quantity is not the same as quality. The Spanish IDS wordlist of 1,680 entries, with most of the same concepts as the WOLD languages from Latin America with dominant Spanish donor, was better.

But the reason for the surprising difference wasn't sufficiently clear, so I checked by using stemmed Spanish wordlists, whether the results might change. After that, I also took the union of the Big and IDS wordlists and looked at their coverage.

Source of Big wordlist: http://frequencylists.blogspot.com/2016/05/the-8600-most-frequently-used-spanish.html Stemmer was from: Natural Language Toolkit (NLTK) SnowballStemmer, based on original Porter Snowball Stemmer.

Results:

Stemming reduces the Big wordlist by half - many inflectional variants, some accent variants as well, few false truncations.
Stemming seems to increase potential coverage - whether we obtain this in practice will depend on whether our distance measurements are robust to inflectional differences. [Maybe possibility to tune such measures.]
Combining IDS and Big wordlists increases coverage substantially; with stemming too, we could reach 90%.

Dictionary source             Size    Matched    No match    %coverage
--------------------------  ------  ---------  ----------  -----------
Big wordlist                  8645        990         490         0.67
IDS wordlist                  1683       1112         368         0.75
Big stemmed wordlist          4114       1094         386         0.74
IDS stemmed wordlist          1571       1193         287         0.81
Combined wordlists            9429       1275         205         0.86
Combined stemmed wordlists    4762       1329         151         0.90

Using stemming complicates matters in some respects, so other than establishing the potential coverage, I will not incorporate it into glossaries for now.

Instead I will pursue the idea of a combined wordlist of IDS and the Big wordlist with potential of 86% match with donor words.

Next step then is to see if I can't make this combined simple wordlist of forms into a true glossary or dictionary wordlist of concepts and forms and segmented IPA.

fractaldragonflies commented 2 years ago

Path forward idea is to create a donor wordlist that includes both current Sabor Spanish wordlist and the Big wordlist (based on 8,600 entries). Then substitute this for the Sabor Spanish wordlist in current methods... which includes multilingual and recently demonstrated hybrid method with monolingual and multilingual functions. All of this would be a level of prototype which may not fit very well into current Sabor framework.

Some steps could be:

Reconcile big dictionary with concepticon concepts. 1.1 Big dictionary comes with English glosses, so this is issue of approximately matching dictionary with concepticon glosses.
Add tokens in segmented BIPA to dictionary forms.
Merge with existing Spanish wordlist.
Use this in models to see what performance actually achieved.

fractaldragonflies commented 2 years ago

Results using Concepticon Spanish dictionary

Work from Aug 19, 2022

Setup a repository at https://github.com/fractaldragonflies/expand-wordlists.git Test coverage of wordlists and determine how to expand them if necessary

To begin, a collection of a few related functions with intermeiate files and reports. No arguments on the functions. Just run with python. Change code to change filenames and options.

ids-wordlist - creates wordlist from IDS Spanish table.
spanish_ids_form_concepts.tsv
langauge-wordlist - creates wordlist from Concepticon depending on language specified (‘es’ is default).
concepticon_spanish_forms.tsv
sabor-donor-wordlist - created wordslist of all Spanish donor words to langauges in Sabor dataset.
sabor_donor_wordlist.tsv [each form only once]
sabor_donor_wordlist_NR.tsv [duplicates of forms not restricted]
util.py - some utility functions used by other functions in this module.
words-to-concepticon.py - reports number of sabor forms found, by target concept or other concept.
check-form-coverage.py - reports coverage of sabor forms from IDS Spanish, Concepticon, and Big dictionary wordlists.

Earlier Analysis of Coverage:

% python src/words_to_concepticon.py
Analysis based on concepticon concept table.
Concepticon items basis:
747 glosses
281 no matched forms
466 matched forms in concept lists
Spanish donor wordlist basis:
1480 glosses
389 not matched forms
1091 matched forms in concept lists
  925 same concept match
  166 different concept match
26.3% no-coverage (words without concepts)
73.7% coverage (words with concepts)
  62.5% same concept coverage
  11.2% different concept coverage
Wrote 389 forms to file not_matched_sabor_donor_forms_NR.tsv.
Wrote 925 concept-forms to file same_concept_sabor_donor_forms_NR.tsv.
Wrote 166 concept-forms to file different_concept_sabor_donor_forms_NR.tsv.

Analysis based on ids Spanish donor table.
Read 1683 form_concepts from file spanish_ids_form_concepts.tsv.
Concepticon items basis:
747 glosses
260 no matched forms
487 matched forms in concept lists
Spanish donor wordlist basis:
1480 glosses
368 not matched forms
1112 matched forms in concept lists
  1000 same concept match
  112 different concept match
24.9% no-coverage (words without concepts)
75.1% coverage (words with concepts)
  67.6% same concept coverage
  7.6% different concept coverage
Wrote 368 forms to file ids_not_matched_sabor_donor_forms_NR.tsv.
Wrote 1000 concept-forms to file ids_same_concept_sabor_donor_forms_NR.tsv.
Wrote 112 concept-forms to file ids_different_concept_sabor_donor_forms_NR.tsv.

Recently Added Coverage with Concepticon Spanish Forms

Substantial improvement for use of Concepticon

Use of Big wordlist, adds only 4% more (unstemmed) or 3% more (stemmed).

% python src/check-form-coverage.py 
Dictionary source                    Size    Matched    No match    %coverage
---------------------------------  ------  ---------  ----------  -----------
Big wordlist                         8645        990         490         0.67
IDS wordlist                         1683       1112         368         0.75
Concepticon wordlist                 2620       1244         236         0.84
Big stemmed wordlist                 4114       1094         386         0.74
IDS stemmed wordlist                 1571       1193         287         0.81
Concepticon stemmed wordlist         2335       1317         163         0.89
IDS+Concepticon wordlists            2832       1253         227         0.85
IDS+Concepticon stemmed wordlists    2529       1327         153         0.90
All wordlists                       10108       1313         167         0.89
ALL stemmed wordlists                5319       1372         108         0.93

LinguList commented 2 years ago

Bueno, this means we could argue that we get good / okay coverage if we just go for some Concepticon wordlist with a targeted wordlist like IDS, which we have as resources anyway?

LinguList commented 2 years ago

It is interesting, and we can add this to the study as part of an error analysis (or in another study), since it is not clear how many words are specialized which are borrowed, etc., and how one can retrieve some good percentage. If we can argue we hit > 80% with the current concepticon and even more in combination with IDS as seed list, this would be a good argument to say: for Spanish, use THIS list to identify borrowings in South America.

LinguList commented 2 years ago

Important: Scholars now use often word lists for tasks like sentiment detection, thread detection, etc., so they lump some words in English and use them to search for potential tweets. I have a reference for this and can look it up later. What we do here is another work, we design wordlists for borrowing detection of donor languages. This is similar to the sentiment task and very cool, since we need our wordlists in IPA, etc., so we can "sell" these as a resource of their own right!

This can in fact be discussed in an extra paper on this topic.

fractaldragonflies commented 2 years ago

Bueno, this means we could argue that we get good / okay coverage if we just go for some Concepticon wordlist with a targeted wordlist like IDS, which we have as resources anyway?

The IDS result is based on encountering known borrowed words in the IDS Spanish wordlist, and
The Concepticon result is from language_map = Concepticon._get_map_for_language(language) based wordlist.
So Concepticon includes Spanish words from more than IDS - maybe a partial union of Concepticon lists?
The IDS + Concepticon includes more than just Concepticon, so not all of IDS was used. However, there is only a 1% increase in coverage for IDS isn't much though.

The stemmed results represent maybe an ideal case of matching or detection with our various methods. Stemming in this case was both of the known donor words and the various possible donor source wordlists.

So to take advantage of this with our methods, we'd need to provide the complete Concepticon Spanish wordlist (optionally union with IDS for the extra 1%) as segmented BIPA in an alternative donor wordlist.

LinguList commented 2 years ago

I am now starting to wonder how to go on with this. We could think of some magic future project that stores donor wordlists for several of the worlds' dominant languages and would allow scholars to make quick checks on the influence of a particular language on their particular language.

I'd like to discuss this idea more closely with colleagues, since it feels, that the easiest way to publish these wordlists would be to make CLDF datasets for individual dominant languages. These could then be tested and presented on some occasion. But our workflow would essentially consist in

loading a lexibank dataset, like the SA languages from WOLD or any other dataset, and
loading a recognized "seed" language (a donor language) from our dedicated repositories (in this case, it would be our master Spanish list)
make the inference and check

This changes the workflow slightly (but the functions are in fact written in an abstract way, as we discussed, so they can be used!), but it also modifies the perspective. And I really like this, also the parallel with wordlists for other topics in psychology and the like (see here: https://arxiv.org/abs/2205.15850).

fractaldragonflies commented 2 years ago

I see from the Concepticon defined inside pysem that its keys are words or phrases in the language selected, maybe even multiple words in the same key separated by commas. The total number of glosses for 'es' in this Concepticon is 3,449 - quite a bit greater than the 2,620 that I obtain using function call _get_map_for_language(’es').

Should I be using a different method/function to get individual language results out of Concepticon? Since this Concepticon is a dictionary, the 3,449 keys are unique, but I note that some keys are just differences of UPPER case versus lower case, and maybe also use of determiner too. Thanks!

LinguList commented 2 years ago

The concepticon mapping function is specific in trying to use hash-tables to avoid extensive string operationns, which are often intransparent. As a result, we have this artificially bloated dataset with upper and lower case unifying all glosses that we find in all mapped concept lists.

LinguList commented 2 years ago

Since we discuss "wordlists" here, no "concept lists" of Spanish, I'd be inclined to suggest that we do not necessarily take the Concepticon as much more published "word lists" of Spanish, like IDS, etc., which is also different from the "concept list" in IDS, which is also translated in Spanish!

LinguList commented 2 years ago

The distinction is in fact very useful, since it also means that our wordlists will be linked to Concepticon (or maybe not !).

LinguList commented 2 years ago

A possible starting list for the wordlist for Spanish would be the combination of the IDS word list (linked to concepticon) with the words (not yet linked to concepticon) which are provided as source words for borrowings. If we link these words to concepticon, we'd have a combined wordlist that we can even model in the form of a single CLDF dataset. An alternative would be to single out the Portugues and the Spanish IDS wordlists and just say: these are our seed lists to get started. So instead of SABOR, we would make a romancedonor repository with Portuguese and Spanish alone, where we also provide the information on the network, all coded in CLDF for now (for the network, I could provide some ideas).