lexibank / walworthpolynesian

CLDF dataset derived from Walworth's "Polynesian Segmented Data" from 2019
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

refine orthography by manually correcting the problematic items #3

Closed LinguList closed 4 years ago

LinguList commented 5 years ago

If you check TRANSCRIPTION.md you find there are a couple of problems due to manual work, which can be easily refined. Just place a list called 'lexemes.tsv' into `etc/' and show how the value should be modified. Then you can use that to refine the segmentation. But as a first task, identify only the four outliers, i.e., their identifiers, so we can list them here.

Schweikhard commented 5 years ago

Ah, I see, some spaces missing. +s, +ʔ, e+, u+ Identifiers? Their ID? emae1030-1086-1, mangareva239-1432-1, rarotongan58-1243-2, tuvalu753-1424-2

LinguList commented 5 years ago

Their ID in the beginning of the text.

Schweikhard commented 5 years ago

1510, 2010, 1224, 7247 would be the IDs in the original file.

LinguList commented 5 years ago

Easiest way to fix this is to make a python dictionary with the ID and a recommended better form. Then you would use the get() method to insert the segments:

Segments={1510: "b l a + b l u".split()}.get(idx) or wl[idx, 'segments'],

This is not nice, in addition, you could make a CORRECTED version of the data, and place it in the repository as a copy of the original file, and we ask Mary to correct it in a new version (which is trivial). But please try it once, so you learn a bit what happens with the cldf creation procedure.

SimonGreenhill commented 4 years ago

these are fixed in the corrected version of the data