lexibank / sabor

CLDF datasets accompanying investigations on automated borrowing detection in SA languages
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Make a map with the number of borrowings from Spanish per Language in our sample #27

Closed LinguList closed 1 year ago

LinguList commented 2 years ago

In order to do that, we need to do the following:

  1. compute number of borrowings and add the information to etc/languages.tsv or on the fly to cldf/languages.csv
  2. plot the language data with cldfviz

The only question is what to make with Spanish in our sample. We want a map that showing only a specific region. If one does a HTML map, one can make a screenshot, which may be good enough. If not, one needs to define the borders.

fractaldragonflies commented 2 years ago

I have the cartopy example that we used to display maps of Saphon. It used min and max long and lat to define borders, but this could be set instead. It also reads the languageTable as cldf object for lat and long... suppose if we add # borrowed it would be accessible as well. Might that be sufficient for the task?

LinguList commented 2 years ago

Yes, but cldfviz is like this:

cldfbench cldfviz.map cldf/cldf-metadata.json --language-labels --language-properties=Spanish_Borrowings
LinguList commented 2 years ago

map.html.zip Bildschirmfoto_2022-06-02_20-14-18

LinguList commented 2 years ago

Problem now is that I added Spanish_Borrowings as float, but cldf defines it as a string (I did not change this). So we best make something like modifying the Spanish_Borrowings to some categories, like < 10% < 15% < 20% < 25% < 30% Then we have a decent color map as well.

LinguList commented 2 years ago

So my assumption is that this will be faster, and you learn another nice tool ;)

LinguList commented 2 years ago

I modified the code to add the information on the percentage of borrowings from Spanish as a category. This means then, I hope, we can have some nicer classification.

LinguList commented 2 years ago

But if you want a more customized map, we can go with our old example as well! The info on borrowings is now anyway in the languages.csv file, so it can be readily accessed. If you want to go ahead (but if it takes too long, leave it please) I'd later double check.

LinguList commented 2 years ago

map-1 legend

LinguList commented 2 years ago
cldfbench cldfviz.map cldf/cldf-metadata.json --language-labels --language-properties=Borrowing_Class --markersize=40 --base-layer Esri_WorldPhysical
fractaldragonflies commented 2 years ago

Great!! Very pretty too.

fractaldragonflies commented 2 years ago

To get the labels to show, should I edit the .png and overlay the labels on the South Pacific? Or is there a way to place this directly on the map... similar to what MatLab allows for figure placement?

Computed borrowing in lexibank_sabor.py doesn't take into account borrowed_score, and so the numbers are higher than what we get when we get_our_data. I think the borrowing class stays within the indicated range so no real impact on the graphic. I'll add qualify for borrowed_score too ... since form is accessed in the same phrase, it should be straight-forward.

[1 for form in language.forms_with_sounds if borrowings.get(
                        form.id[5:], [""])[0] == "Spanish"])
LinguList commented 2 years ago

Can you modify he lexibank script code in this regard? I know I did not follow your borrowing definition, I did not have time to look it up.

fractaldragonflies commented 2 years ago

Will do. Working from Cusco for a few days, so a bit less effectively for multiple reasons!

fractaldragonflies commented 2 years ago

Changed accounting for borrowing in make_cldf command.

Was more 'interesting' than I thought it would be since overall forms are censored for whether they have concepticon_gloss. So I counted only forms with concepticon gloss and used this as the base for borrowed proportion calculation. Now calculation of proportion of borrowed is consistent with results return for such calculation over the wordlist by language (and overall).

I can of course reverse this if this consistency is less important than the gross measure of borrowing. I did want to at least examine discrepancy in form counts, which I report here. There is more difference due to censoring for not having concepticon gloss than for our borrowing score definition.

I printed out the different counts of all forms versus forms with concepticon glosses during make_clf:

cldfbench lexibank.makecldf lexibank_sabor.py
INFO    running _cmd_makecldf on sabor ...
INFO    loaded borrowings
INFO:lingpy:loading wold
loading forms for wold: 100%|█████████████████████████████████████████████████████████████████| 64289/64289 [00:04<00:00, 12990.95it/s]
INFO:lingpy:loading ids
loading forms for ids: 100%|████████████████████████████████████████████████████████████████| 454145/454145 [00:08<00:00, 51723.55it/s]
INFO:lingpy:loaded wordlist with 1489 concepts and 370 languages
INFO    added ['ids-Spanish']

======

Added: name Yaqui, language wold-Yaqui
INFO    Yaqui all forms 1615, forms with concepts 1433, borrowed 311, prop 0.21702721563154223; forms no concepts 7
Added: name Zinacantán Tzotzil, language wold-ZinacantanTzotzil
INFO    Zinacantán Tzotzil all forms 1413, forms with concepts 1266, borrowed 165, prop 0.13033175355450238; forms no concepts 1
Added: name Q'eqchi', language wold-Qeqchi
INFO    Q'eqchi' all forms 1995, forms with concepts 1773, borrowed 161, prop 0.09080654258319233; forms no concepts 2
Added: name Otomi, language wold-Otomi
INFO    Otomi all forms 2558, forms with concepts 2241, borrowed 198, prop 0.08835341365461848; forms no concepts 2
Added: name Imbabura Quechua, language wold-ImbaburaQuechua
INFO    Imbabura Quechua all forms 1319, forms with concepts 1156, borrowed 300, prop 0.25951557093425603; forms no concepts 23
Added: name Wichí, language wold-Wichi
INFO    Wichí all forms 1361, forms with concepts 1219, borrowed 152, prop 0.12469237079573421; forms no concepts 1
Added: name Mapudungun, language wold-Mapudungun
INFO    Mapudungun all forms 1412, forms with concepts 1242, borrowed 190, prop 0.1529790660225443; forms no concepts 48

=====

INFO    file written: /Users/johnmiller/ling/sabor-installs/sabor/cldf/.transcription-report.json
INFO    Summary for dataset /Users/johnmiller/ling/sabor-installs/sabor/cldf/cldf-metadata.json
- **Varieties:** 8
- **Concepts:** 1,308
- **Lexemes:** 12,100
- **Sources:** 0
- **Synonymy:** 1.30
- **Invalid lexemes:** 0
- **Tokens:** 72,550
- **Segments:** 112 (0 BIPA errors, 0 CTLS sound class errors, 112 CLTS modified)
- **Inventory size (avg):** 39.38

Here is the snippet of code from lexibank_sabor.py that counts the number of forms:

            borrowed = sum(
                    [1 for form in language.forms_with_sounds
                        if borrowings.get(form.id[5:], [""])[0] == "Spanish" and
                        float(form.data["Borrowed_score"]) > BOR_CRITICAL_VALUE and
                        form.concept and form.concept.concepticon_gloss in concepts])

Based on the original code creating forms.

LinguList commented 2 years ago

Nice. I can redo the map in HTML and we can use that in some form in the paper, adding larger labels manually, maybe.