lexibank / chacolanguages

CLDF datasets accompanying Brid et al.'s "Lexical Database of Chaco Languages" from 2022
Creative Commons Attribution 4.0 International
0 stars 1 forks source link

Extremely low coverage for a few languages #6

Closed LinguList closed 2 years ago

LinguList commented 2 years ago

I should've run this earlier, but now I did it and we need to revise our account on the data:

The last line in the following table is the proportion of concepts attested from teh overall number of 324 concepts:

Abipón               215  154  0.48
Ava Guaraní          262  214  0.66
Ayoreo               375  227  0.70
Chamacoco            250  162  0.50
Enlhet               437  251  0.77
Enxet Sur            333  208  0.64
Guaraní Izoceño       24   24  0.07
Guaraní Paraguayo    323  237  0.73
Iyojwa'ja Chorote    358  272  0.84
Iyoʼwujwa Chorote    254  190  0.59
Iyówuj'wa             45   37  0.11
Kadiweo              224  157  0.48
Lule                 294  173  0.53
Maká                 281  242  0.75
Mapudungun           254  206  0.64
Mbya                 222  167  0.52
Mocoví               297  215  0.66
Nivaclé              373  248  0.77
Pilagá               286  248  0.77
Quichua Santiagueño  235  176  0.54
Tapiete              271  202  0.62
Toba                 470  273  0.84
Toba de Cerrito      179  153  0.47
Toba-pilagá          367  254  0.78
Vilela                73   60  0.19
Wichí                387  241  0.74
-------------------  ---  ---  ----
LinguList commented 2 years ago

We can probably only keep those languages with at least 50 percent coverage.

This leaves the following:

-------------------  ---  ---  ----
Ava Guaraní          262  214  0.66
Ayoreo               375  227  0.70
Chamacoco            250  162  0.50
Enlhet               437  251  0.77
Enxet Sur            333  208  0.64
Guaraní Paraguayo    323  237  0.73
Iyojwa'ja Chorote    358  272  0.84
Iyoʼwujwa Chorote    254  190  0.59
Lule                 294  173  0.53
Maká                 281  242  0.75
Mapudungun           254  206  0.64
Mbya                 222  167  0.52
Mocoví               297  215  0.66
Nivaclé              373  248  0.77
Pilagá               286  248  0.77
Quichua Santiagueño  235  176  0.54
Tapiete              271  202  0.62
Toba                 470  273  0.84
Toba-pilagá          367  254  0.78
Wichí                387  241  0.74
-------------------  ---  ---  ----
LinguList commented 2 years ago

We would then ignore:

---------------  ---  ---  ----
Abipón           215  154  0.48
Guaraní Izoceño   24   24  0.07
Iyówuj'wa         45   37  0.11
Kadiweo          224  157  0.48
Toba de Cerrito  179  153  0.47
Vilela            73   60  0.19
---------------  ---  ---  ----
LinguList commented 2 years ago

@Bridnicolas, but we can also say we take at least 150 concepts. In that case, we ignore only Guarani Izoceno, Iyowuj'wa and Vilela.

LinguList commented 2 years ago

@Bridnicolas, can I ask you to check now with this automated list?

   Number  Variety                Forms    Concepts    Base    BIO    Coverage
--------  -------------------  -------  ----------  ------  -----  ----------
       1  Abipón                   215         154     154      0        0.48
       2  Ava Guaraní              262         214     214      0        0.66
       3  Ayoreo                   375         227     211     16        0.70
       4  Chamacoco                250         162     161      1        0.50
       5  Enlhet                   437         251     216     36        0.77
       6  Enxet Sur                333         208     188     20        0.64
       7  Guaraní Paraguayo        323         237     213     24        0.73
       8  Iyojwa'ja Chorote        358         272     214     58        0.84
       9  Iyoʼwujwa Chorote        254         190     176     14        0.59
      10  Kadiweo                  224         157     156      1        0.48
      11  Lule                     294         173     173      0        0.53
      12  Maká                     281         242     199     44        0.75
      13  Mapudungun               254         206     206      0        0.64
      14  Mbya                     222         167     167      0        0.52
      15  Mocoví                   297         215     212      3        0.66
      16  Nivaclé                  373         248     215     33        0.77
      17  Pilagá                   286         248     211     38        0.77
      18  Quichua Santiagueño      235         176     162     14        0.54
      19  Tapiete                  271         202     193      9        0.62
      20  Toba                     470         273     216     58        0.84
      21  Toba de Cerrito          179         153     153      0        0.47
      22  Toba-pilagá              367         254     192     63        0.78
      23  Wichí                    387         241     208     33        0.74
LinguList commented 2 years ago

My suggestion is to transfer this list, when isssues #8 and #7 have been solved, to our paper where we have the table. The good thing is: we can automate all numbers. The bad thing is: it does not look very exhaustive for the basic concepts. But well, we can probably live with this.

Bridnicolas commented 2 years ago

Yes. BIO means "ethnobiological terms"? In that case, it may be incorrect in some parts. We cannot have 154 Abipón ethnobiological terms, because we actually don't have any (zero!)

Besides, I'm not sure ignoring Iyowujwa and Guarani Izoceño would be accurate. We only have ethnobiological terms for those, not base terms (because they are generally considered dialects of Iyojwaja and Ava Guaraní, respectively), but the thing is that the ethnobio terms are many, and make a good part of our analysis.

LinguList commented 2 years ago

See my updated table. I swapped them :(

Bridnicolas commented 2 years ago

Yes, I just have. Sorry

LinguList commented 2 years ago

So it is all nice as is: we have some ethnobiological items and basic vocab. Fine. They can later be expanded.

LinguList commented 2 years ago

I am sorry for the error.

Bridnicolas commented 2 years ago

Below 0.5 there's Abipón, Kadiwéu and Toba de Cerrito. Are we still planning to ignore them? So, we transfer the list to the article just as it is?

LinguList commented 2 years ago

No, we first need to address the issues on concepts I just filed. Then we select all languages with > 150 concepts, so we keep Abipon, etc.

Bridnicolas commented 2 years ago

No, wait. I'm looking at the other issues.

LinguList commented 2 years ago

Would be a pity. But with 17 concepts, like one of the languages, we cannot really work.

LinguList commented 2 years ago

I'll have to go to bed now. But I'll pursue tomorrow. The good thing is: we have one more automatic step for checking now :)

Bridnicolas commented 2 years ago

Ok, till tomorrow. I can't find the GBIF ID for issue #7, but I'll keep looking.

LinguList commented 2 years ago

Isn't it this one?

Bridnicolas commented 2 years ago

Yes, but there is no ID. At least visible on the webpage like with the other plants.

LinguList commented 2 years ago

Isn't the ID the number in the URL? 7291664

Bridnicolas commented 2 years ago

Ops. Didn't pay attention to that.

LinguList commented 2 years ago

I already added that number to our file, so no worries :)

Bridnicolas commented 2 years ago

Now all forms for younger brother are added in Tokens. I have to complete the other columns (form and value), but I have to enter a (hopefully short) meeting. And then I'll continue

LinguList commented 2 years ago

Nice, inform me, once done, and I'll then re-run the analysis. So we can advance the study already today, as I am now almost done with the paper draft!

Bridnicolas commented 2 years ago

'younger brother' is now complete. Checked, segmented and all. I'll proceed now to correct the sources on the document.

LinguList commented 2 years ago

Running the code now!

LinguList commented 2 years ago

@Bridnicolas, this looks fine now, and we have 324 concepts now. I will need to make the language statistics again later, as I have them at home, and forgot to synchronize before going to the office. But we may be able to submit this study on Monday then.

Bridnicolas commented 2 years ago

That's great.

Bridnicolas commented 2 years ago

No, wait wait. What is "iyówuj'wa". On the list. It's repeated. Chorote Iyowujwa.

LinguList commented 2 years ago

What List?

LinguList commented 2 years ago

Oh dear, so can you please check the file etc/languages.tsv, what is happening there? We must have two entries then!

LinguList commented 2 years ago

Ah, yes, check that list: the ID says "Manjuy", so the name is probably different, right? Can you fix that? Then I re-run, correcting is no problem, we make a release version 0.2 then!

Bridnicolas commented 2 years ago

Yes. Done. It's correct now.

LinguList commented 2 years ago

Oh. You modifed the ID as well, but the ID should not be touched!

Bridnicolas commented 2 years ago

Are you sure? Not right now. I only changed Manjui, and Iyówuj'wa for Iyo'wujwa' Chorote, both under the Name column.

LinguList commented 2 years ago

Okay, I'll check again.

LinguList commented 2 years ago

You modified the geo-coordinates. They are wrong for Cerriteno and TobaPilaga now, for this reason, compilation fails. Can you check please?

Bridnicolas commented 2 years ago

Yes, now I see. Can't imagine how that happened. I'll fix it now

Bridnicolas commented 2 years ago

Done

LinguList commented 2 years ago

Works now