intercontinental-dictionary-series / keypano

IDS data on Panoan languages coded by Key
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Update online data for edictor to start with borrowing coding #20

Closed LinguList closed 1 month ago

LinguList commented 2 years ago

Once ready, we can start to code individual languages or certain concept ranges.

fractaldragonflies commented 2 years ago

Since @LinguList would have to link in the database to edictor and per our off-line conversation, I request @LinguList to recreate the SQLITE DB and link it to Edictor. I've changed the raw/preprocessing.py file to include the 'loan' field and verified that the command: edictor wordlist --preprocessing=raw/preprocessing.py --addon=language_family:family,loan:loan --sqlite --name=keypano correctly creates the SQLITE DB.

Subsequent to this, @LinguList will explain to me how we can use Edictor to annotate borrowing. I will work with @rzariquiey and possibly the student Ximena and @LinguList to annotate the KeyPano languages.

There will likely be unanticipated issues in how to annotate specific borrowing cases or how to use Edictor. We can address these under this issue in GitHub or open new issues if appropriate.

Hopefully I got this right! Discussion is welcome.

LinguList commented 2 years ago

The code is online now, and accessible via https://lingulist.de/edictor/links/keypano.html

For details on how to edit loans, I suggest to discuss for one specific language how to do it.

LinguList commented 2 years ago

We can exchange the details via email, as long as we limit the email content to a single thread, as you'll also need access data.

fractaldragonflies commented 2 years ago

Just took a quick look at the link. a few observations and questions. Which we can take off-line to an email thread as we get into the gory details.

Observations:

  1. For the concept 'plain, field', the word 'k a m p o' is present with exactly the same IPA in both SpanishLA and Cataquina, but it gets no Borid. Even though Portuguese with a less similar phonology is identified with the same Borid.
  2. I also just realized that we use the same Borid for all matching words with the same concept, so that 'p a m p a' which matches a word in Aymara also receives the same Borid as used for 'k a m p o'. Is that what we want?
  3. I showed the Loan field which has True values from the previous IDS annotation.

Questions:

  1. So we annotate by language. I had wondered about annotating by concept? The minus is that I don't really develop specific language expertise. The plus is that I see them altogether for the same concept. You and Roberto have the experience, so I'll follow your lead... just thought to raise the question.
  2. In annotating by language, do I must still process through the concepts. Do I reduce view of the other languages, to simplify the screen? Or keep it all visible since the comparison is also vital.
  3. Can multiple persons annotate a the same time? I guess that is why we use the SQLITE DB, so as to permit multiple updates with different annotators?
  4. Do we update the Loan field to indicate other 'borrowed' words, and also delete the previous True value if we think it an error.
  5. Do we also expose the Source field... to use to annotate the language donor source?

Those are the work process questions that occur to me.

Probably more important, is the issue of what evidence do I look at to determine whether words are borrowed or not! So the thinking process questions:

  1. If they sound alike, which may already be picked up by the Borid, then they are likely borrowed. But it is for us to confirm or add those 'sound alike' not captured by the current Borid. Does this mean we should also edit the Borid?
  2. Besides just sounding alike, we would look to see how many other languages might also to have the same borrowing. Which again might be indicated by Borid.
  3. Do we also consider words that sound like they are borrowed from another concept? e.g., 'plata' for money or coins even though not shown as such for Spanish?
  4. Or even a borrowed part of a word or phrase that gets used in an indigenous language? e.g., 'carne' for part of a name of some animals in indigenous languages.

OK. Enough already! [maybe too much!]

Thanks @LinguList. Also @rzariquiey, please feel free to comment, offer advice at least with respect to the 'thinking process' for determine a borrowed word.

LinguList commented 2 years ago

For the concept 'plain, field', the word 'k a m p o' is present with exactly the same IPA in both SpanishLA and Cataquina, but it gets no Borid. Even though Portuguese with a less similar phonology is identified with the same Borid.

This depends on additional dynamics: the first pass is the family-internal detection of cognates, here, k a m p o is unluckily merged with another word in a Pano language. Since the second pass compares clusters of words, this prevents the detection of "k a m p o". But it is not important: the algorithm works as it works, and may be modified, but what we need to do is to refine these cases manually.

I also just realized that we use the same Borid for all matching words with the same concept, so that 'p a m p a' which matches a word in Aymara also receives the same Borid as used for 'k a m p o'. Is that what we want?

No, this is again the algorithm, whose results need to be refined.

I showed the Loan field which has True values from the previous IDS annotation.

Yes, you can also switch from command line, which values will be shown directly when editing. I have my own set of columns I typically work with,

LinguList commented 2 years ago

As to the annotation, we do not annotate by language, by language meant: we proceed by language or we could do so, but I actually prefer now also to proceed by concept, as it is much faster. It was just an offer to work with one language alone, since @rzariquiey suggested to look at three languages first. But I'd say: let's do concept-wise. There are strikingly fewer borrowings in this dataset than you'd find in SEA languages, so this is already interesting.

LinguList commented 2 years ago

The procedure of working on one concept consists of:

  1. refining borids (as pointed out by @fractaldragonflies, e.g., "kampo" should be given the same borid as spanish and Portuguese, etc.
  2. add the source language where known into the SOURCE field (e.g., Spanish in our case of kampo)
  3. ideally, one should also refine cognate set identifiers (but these are best done by using partial cognate sets, so one could also leave it for now, but finally, all data should be annotated in this way at some point)
  4. where a word is missing in Spanish (compare livestock as a concept, where we have "vaka" in most Pano languages), it should be added by us via pressing on the ID field and then selecting language and concept ("add row") and filling in the TOKENS form manually.
LinguList commented 2 years ago

In this way, one can e.g., make a list of all concepts one has edited in one pass, and then post them here in a thread in github. If the concepts are posted as a list separated by |, one can add them to the EDICTOR URL via &concepts=CONCEPTLIST_WITH_PIPES_AS_SEPARATOR and thus only see those cases which have been annotated. This may be convenient for double-checking, e.g., @fractaldragonflies starts with some 10 concepts, posts them here, and I double-check.

LinguList commented 2 years ago

Note that by "URL" I mean the URL to what the link above resolves to, as this is a longer URL in edictor, which has all parameters:

http://lingulist.de/edev/index.html?file=keypano&remote_dbase=keypano&root_formatter=BORID&doculects=MashcoPiro|Aymara|Cayuvava|PacaasNovos|Aguaruna|PortugueseBR|SpanishLA|Itonama|Moseten|Movima|Araona|Cashibo|Catuquina|Cavinena|Chacobo|EseEjja|HuarayoEseEjja|Pacahuara|ShipiboConibo|Tacana|Yaminahua|Yagua&columns=DOCULECT|CONCEPT|FAMILY|SUBGROUP|VALUE|FORM|TOKENS|ALIGNMENT|AUTOCOGID|AUTOCOGIDS|COGID|COGIDS|AUTOBORID|BORID|BORIDS|MORPHEMES|SOURCE|LOAN|NOTE&basics=DOCULECT|CONCEPT|VALUE|FORM|TOKENS|COGID|COGIDS|BORID|MORPHEMES|NOTE&preview=100&async=true
LinguList commented 2 years ago

The "LOAN" field can be derived automatically from the cognate / borid field,so it is not needed, especially if we have the SOURCE annotated (if it is not clear if spanish or portuguese are primary donors, I'd add Latin or Romance or similar to the Source).

LinguList commented 2 years ago

In the case of plata, it should be added for Spanish, as it is just missing in our IDS source, due to an error.

LinguList commented 2 years ago

Borrowed parts in phrases need cross-semantic cognate coding, and morphological annotation ideally, and also that the words be added to the Spanish list, or similar, so I'd say, let us start simple for now, but @fractaldragonflies you can collect those cases, show them to me, and I can make an example of how I'd annotate them.

fractaldragonflies commented 2 years ago

The procedure of working on one concept consists of:

  1. refining borids (as pointed out by @fractaldragonflies, e.g., "kampo" should be given the same borid as spanish and Portuguese, etc.
  2. add the source language where known into the SOURCE field (e.g., Spanish in our case of kampo)
  3. ideally, one should also refine cognate set identifiers (but these are best done by using partial cognate sets, so one could also leave it for now, but finally, all data should be annotated in this way at some point)
  4. where a word is missing in Spanish (compare livestock as a concept, where we have "vaka" in most Pano languages), it should be added by us via pressing on the ID field and then selecting language and concept ("add row") and filling in the TOKENS form manually.

Chévere! We each should have a range of Borids that we can assign then, so as not to tromp on those other annotators have assigned.

LinguList commented 2 years ago

This is also possible. One can easily add them to the database, but experience shows that in the end we won't go for inter-annotator agreement anyway (yet), so it would only make things more complex, but if you guys want to, we can start with it.

fractaldragonflies commented 2 years ago

Note that by "URL" I mean the URL to what the link above resolves to, as this is a longer URL in edictor, which has all parameters:

http://lingulist.de/edev/index.html?file=keypano&remote_dbase=keypano&root_formatter=BORID&doculects=MashcoPiro|Aymara|Cayuvava|PacaasNovos|Aguaruna|PortugueseBR|SpanishLA|Itonama|Moseten|Movima|Araona|Cashibo|Catuquina|Cavinena|Chacobo|EseEjja|HuarayoEseEjja|Pacahuara|ShipiboConibo|Tacana|Yaminahua|Yagua&columns=DOCULECT|CONCEPT|FAMILY|SUBGROUP|VALUE|FORM|TOKENS|ALIGNMENT|AUTOCOGID|AUTOCOGIDS|COGID|COGIDS|AUTOBORID|BORID|BORIDS|MORPHEMES|SOURCE|LOAN|NOTE&basics=DOCULECT|CONCEPT|VALUE|FORM|TOKENS|COGID|COGIDS|BORID|MORPHEMES|NOTE&preview=100&async=true

This link causes my system to become unstable, using up all available memory, and asking me to force quit.

fractaldragonflies commented 2 years ago

Tried to edit the concept: plain, field.

LinguList commented 2 years ago

You need your password and username, which you receive by email. Please remind me, I'll also send them to @rzariquiey etc.

LinguList commented 2 years ago

But permanent changes are needed now, if we decide to NOT modify anything any more on the CLDF form of the data (we can still do so, but modifying the SQLITE is then tedious for me).

LinguList commented 2 years ago

The idea is really to get to know the tool by modifying. Furthermore: there is a history in the sqlite which stores all changes made to the cells!