Closed LinguList closed 1 month ago
Since @LinguList would have to link in the database to edictor and per our off-line conversation, I request @LinguList to recreate the SQLITE DB and link it to Edictor. I've changed the raw/preprocessing.py file to include the 'loan' field and verified that the command:
edictor wordlist --preprocessing=raw/preprocessing.py --addon=language_family:family,loan:loan --sqlite --name=keypano
correctly creates the SQLITE DB.
Subsequent to this, @LinguList will explain to me how we can use Edictor to annotate borrowing. I will work with @rzariquiey and possibly the student Ximena and @LinguList to annotate the KeyPano languages.
There will likely be unanticipated issues in how to annotate specific borrowing cases or how to use Edictor. We can address these under this issue in GitHub or open new issues if appropriate.
Hopefully I got this right! Discussion is welcome.
The code is online now, and accessible via https://lingulist.de/edictor/links/keypano.html
For details on how to edit loans, I suggest to discuss for one specific language how to do it.
We can exchange the details via email, as long as we limit the email content to a single thread, as you'll also need access data.
Just took a quick look at the link. a few observations and questions. Which we can take off-line to an email thread as we get into the gory details.
Observations:
Borid
. Even though Portuguese with a less similar phonology is identified with the same Borid. Borid
for all matching words with the same concept, so that 'p a m p a' which matches a word in Aymara also receives the same Borid
as used for 'k a m p o'. Is that what we want?Loan
field which has True
values from the previous IDS annotation.Questions:
Loan
field to indicate other 'borrowed' words, and also delete the previous True
value if we think it an error.Source
field... to use to annotate the language donor source?Those are the work process questions that occur to me.
Probably more important, is the issue of what evidence do I look at to determine whether words are borrowed or not! So the thinking process questions:
Borid
, then they are likely borrowed. But it is for us to confirm or add those 'sound alike' not captured by the current Borid
. Does this mean we should also edit the Borid
?Borid
.OK. Enough already! [maybe too much!]
Thanks @LinguList. Also @rzariquiey, please feel free to comment, offer advice at least with respect to the 'thinking process' for determine a borrowed word.
For the concept 'plain, field', the word 'k a m p o' is present with exactly the same IPA in both SpanishLA and Cataquina, but it gets no Borid. Even though Portuguese with a less similar phonology is identified with the same Borid.
This depends on additional dynamics: the first pass is the family-internal detection of cognates, here, k a m p o
is unluckily merged with another word in a Pano language. Since the second pass compares clusters of words, this prevents the detection of "k a m p o". But it is not important: the algorithm works as it works, and may be modified, but what we need to do is to refine these cases manually.
I also just realized that we use the same Borid for all matching words with the same concept, so that 'p a m p a' which matches a word in Aymara also receives the same Borid as used for 'k a m p o'. Is that what we want?
No, this is again the algorithm, whose results need to be refined.
I showed the Loan field which has True values from the previous IDS annotation.
Yes, you can also switch from command line, which values will be shown directly when editing. I have my own set of columns I typically work with,
As to the annotation, we do not annotate by language, by language meant: we proceed by language or we could do so, but I actually prefer now also to proceed by concept, as it is much faster. It was just an offer to work with one language alone, since @rzariquiey suggested to look at three languages first. But I'd say: let's do concept-wise. There are strikingly fewer borrowings in this dataset than you'd find in SEA languages, so this is already interesting.
The procedure of working on one concept consists of:
In this way, one can e.g., make a list of all concepts one has edited in one pass, and then post them here in a thread in github. If the concepts are posted as a list separated by |
, one can add them to the EDICTOR URL via &concepts=CONCEPTLIST_WITH_PIPES_AS_SEPARATOR
and thus only see those cases which have been annotated. This may be convenient for double-checking, e.g., @fractaldragonflies starts with some 10 concepts, posts them here, and I double-check.
Note that by "URL" I mean the URL to what the link above resolves to, as this is a longer URL in edictor, which has all parameters:
http://lingulist.de/edev/index.html?file=keypano&remote_dbase=keypano&root_formatter=BORID&doculects=MashcoPiro|Aymara|Cayuvava|PacaasNovos|Aguaruna|PortugueseBR|SpanishLA|Itonama|Moseten|Movima|Araona|Cashibo|Catuquina|Cavinena|Chacobo|EseEjja|HuarayoEseEjja|Pacahuara|ShipiboConibo|Tacana|Yaminahua|Yagua&columns=DOCULECT|CONCEPT|FAMILY|SUBGROUP|VALUE|FORM|TOKENS|ALIGNMENT|AUTOCOGID|AUTOCOGIDS|COGID|COGIDS|AUTOBORID|BORID|BORIDS|MORPHEMES|SOURCE|LOAN|NOTE&basics=DOCULECT|CONCEPT|VALUE|FORM|TOKENS|COGID|COGIDS|BORID|MORPHEMES|NOTE&preview=100&async=true
The "LOAN" field can be derived automatically from the cognate / borid field,so it is not needed, especially if we have the SOURCE annotated (if it is not clear if spanish or portuguese are primary donors, I'd add Latin or Romance or similar to the Source).
In the case of plata
, it should be added for Spanish, as it is just missing in our IDS source, due to an error.
Borrowed parts in phrases need cross-semantic cognate coding, and morphological annotation ideally, and also that the words be added to the Spanish list, or similar, so I'd say, let us start simple for now, but @fractaldragonflies you can collect those cases, show them to me, and I can make an example of how I'd annotate them.
The procedure of working on one concept consists of:
- refining borids (as pointed out by @fractaldragonflies, e.g., "kampo" should be given the same borid as spanish and Portuguese, etc.
- add the source language where known into the SOURCE field (e.g., Spanish in our case of kampo)
- ideally, one should also refine cognate set identifiers (but these are best done by using partial cognate sets, so one could also leave it for now, but finally, all data should be annotated in this way at some point)
- where a word is missing in Spanish (compare livestock as a concept, where we have "vaka" in most Pano languages), it should be added by us via pressing on the ID field and then selecting language and concept ("add row") and filling in the TOKENS form manually.
Chévere! We each should have a range of Borids that we can assign then, so as not to tromp on those other annotators have assigned.
This is also possible. One can easily add them to the database, but experience shows that in the end we won't go for inter-annotator agreement anyway (yet), so it would only make things more complex, but if you guys want to, we can start with it.
Note that by "URL" I mean the URL to what the link above resolves to, as this is a longer URL in edictor, which has all parameters:
http://lingulist.de/edev/index.html?file=keypano&remote_dbase=keypano&root_formatter=BORID&doculects=MashcoPiro|Aymara|Cayuvava|PacaasNovos|Aguaruna|PortugueseBR|SpanishLA|Itonama|Moseten|Movima|Araona|Cashibo|Catuquina|Cavinena|Chacobo|EseEjja|HuarayoEseEjja|Pacahuara|ShipiboConibo|Tacana|Yaminahua|Yagua&columns=DOCULECT|CONCEPT|FAMILY|SUBGROUP|VALUE|FORM|TOKENS|ALIGNMENT|AUTOCOGID|AUTOCOGIDS|COGID|COGIDS|AUTOBORID|BORID|BORIDS|MORPHEMES|SOURCE|LOAN|NOTE&basics=DOCULECT|CONCEPT|VALUE|FORM|TOKENS|COGID|COGIDS|BORID|MORPHEMES|NOTE&preview=100&async=true
This link causes my system to become unstable, using up all available memory, and asking me to force quit.
Tried to edit the concept: plain, field.
You need your password and username, which you receive by email. Please remind me, I'll also send them to @rzariquiey etc.
But permanent changes are needed now, if we decide to NOT modify anything any more on the CLDF form of the data (we can still do so, but modifying the SQLITE is then tedious for me).
The idea is really to get to know the tool by modifying. Furthermore: there is a history in the sqlite which stores all changes made to the cells!
Once ready, we can start to code individual languages or certain concept ranges.