kinbank / kinura

Data for Kinura project
1 stars 1 forks source link

Investigate Estonian #5

Open SamPassmore opened 1 year ago

SamPassmore commented 1 year ago

There are three Estonian files. Reduce these down to a single file.

The easiest solution will be to take the data from Pajusalu. Terhi has sent Sam an excel file with these results.

SamPassmore commented 1 year ago

Note: the Pajusalu file only has male speaking terms but there is no gender distinction, so they can be copied across.

tekahon commented 1 year ago

Few comments & clarifications to this: In total there are three versions of Estonian within two files. File 1 (Estonian_esto1258.csv) has a) the one collected from IDS (this is the Kunnap one) and b ) the one by Pajusalu. File 2 is the Kinura file collected by Niklas (which should not be touched).

But yeah, it would be clearest and easiest if Varikin version of Estonian is created from scratch by taking the data from Pajusalu. This way the errors coming from IDS would be excluded automatically).

SamPassmore commented 1 year ago

@tekahon to clarify as I look through the data, There are now only two Estonian files, since the IDS files were identical in kinura and in varkin. The varikin file contains some data from Pajusalu.

Is there any reason why we can't just put these terms into the Kinura file? You say it should not be touched in your comment, but I don't know why that is. It would make sense to me to keep all these terms together (or add support to the Niklas' data, since between the two collections, they are mostly the same).

tekahon commented 1 year ago

The point of having Pajusalu's data separate from Kinura is to emphasize and give credit to the expert and native speaker (in this case Pajusalu) who took the time to collect the data. This is also the way we decided to handle e.g. the Erzya files. In addition, I checked and these two files have quite many differences so it is interesting to have the dictionary version and the version of the native language speaker side by side.

Thus, I still suggest to have the Pajusalu's data as the only one in Varikin repository and the kinura Estonian-file in the Kinura repository.

Sorry it took me so long to reply to these questions. I was on a holiday for three weeks but now back in business.

SamPassmore commented 1 year ago

Ok. If they are very different, then it is fine. I wasn't suggesting to remove either author - just that if two sources agree on a kin term then we could just put two references for that term, rather than maintaining two different files. I'll look at this again shortly.

I hope you had a good holiday!

tekahon commented 1 year ago

Ok, cool. Let me know if you need some more info from me to this :)