intercontinental-dictionary-series / keypano

IDS data on Panoan languages coded by Key
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Finishing up Key-Pano database... #23

Closed fractaldragonflies closed 3 months ago

fractaldragonflies commented 2 years ago

Update to this issue added Mar 13, 2023 in reply to my earlier note and subsequent reply from Mattis. See further below.

The current form was based on IDS with our orthographic profiles used to calculate segmented BIPA and LingPy cognate detection methods used to calculate CogId, CogIds, and BorId. We've since edited the database using EDICTOR, revising BorIds, adding source (known donor language and form), add '!PB! annotation to Note field.

Need now is to put database into its final form for use and distribution. In some way we need to replace the original forms.csv calculated from IDS with a new forms.csv that includes the fields we add and revised as well as some complete added entries. And remove the original creation step so that it cannot be mistakenly overwritten with just IDS entries.

So this would seem to be a new generation of the Key-Pano with new name too -- to avoid confusion of databases?

=== Some issues to resolve ===

LinguList commented 2 years ago

Luckily, this is not the first time to do it. And we have examples from the SEABOR workflow. I suggest I take care of feeding the data back in and will keep @fractaldragonflies in the loop on all steps. What is good is that we can "clean" what is not that clean in EDICTOR now, and will also check for consistency.

fractaldragonflies commented 1 year ago

Returning to issue of finishing up our KeyPano database.

Here I repeat some of my commentary of my Mar 6 email, with some of response from @LinguList and a bit further processing for my part. Maybe with references to other issues as needed.

(1) It seems the next steps would be to a) clean up the database as needed still, and then b) model it again to the cldf standard (as our starting keypano database).

(2) @LinguList has agreed to look at modeling the annotations in cldf, and will work with me to teach me how to do it.

Observations

I did a brief review of the current database, making a spreadsheet from the EDICTOR format of the 'forms' relation:

Cleanup and cldf modeling

ISSUES

I'm less sure what to do in these cases:

fractaldragonflies commented 1 year ago

@fractaldragonflies recalled that I had added several entries to the Spanish donor the key-pano database. Specifically, I had add 34 entries to the IDS dictionary. The .pdf attached to this note documents those new entries. I've since marked the NOTE field as !NEW! for these new entries. Just as !PB! can be coded as a boolean indicator of partial borrowing, the !NEW! can be coded as a boolean indicator of addition to the original IDS Spanish data.

key-pano-db-sp-adds.pdf

fractaldragonflies commented 1 year ago

Quechua is the largest donor after Spanish. If we include a Quechua wordlist, this would provide more opportunity for the multi-lingual methods to detect borrowings from Quechua.

WOLD includes Imbabura Quechua (Ecuador region). While not from the Quechua subgroup closest to the Pano-Tacanan languages, it might still be an adequate representative of the entire Quechua family for multilingual sequence comparison methods.

Should we add Imbabura Quechua to the Key-Pano database?

LinguList commented 1 year ago

Yes, this sounds like a good idea. It will delay our process of annotation, but I could try to give it some priority. It would also require me fiddling around with EDICTOR settings. But it may maybe really be worth it.

The easiest way to proceed is to create a manual Quechua wordlist with concept glosses as we use in EDICTOR / Key-Pano (thus, wold glosses, I assume) and segment it maybe manually already. This wordlist, presented as TSV file to me, can then by me be included as part of EDICTOR.

LinguList commented 1 year ago

Annotation would be probably faster this time, since one can filter EDICTOR entries for Quechua words and then compare them.

fractaldragonflies commented 1 year ago

Annotation would be probably faster this time, since one can filter EDICTOR entries for Quechua words and then compare them.

Will work on this today to get Quechua wordlist as tsv file.

LinguList commented 1 year ago

Nice!