unicode-org / unilex

Lexical data at Unicode
Other
61 stars 16 forks source link

Tok Pisin #7

Open evali1 opened 5 years ago

evali1 commented 5 years ago

Worthwhile project, but the corpus has lots of plain English (both Am & Br/Aus) and probably the source materials contain texts in English as well as Tok Pisin and the former need to be excluded before data is collected. Also consider purging items with a stop in the middle, which appear to be omitted spaces and not bona fide forms. Further, there are many many proper names which are presumably not very interesting for the purposes of the endeavor.

The trickier bit is the spelling variation of words which are actually the same, depending on regionally varying pronunciation as well as varying degrees of influence from English writing; thus e.g. avris and abrus are the same thing ('avoid'; at least one more variant is in there), and the forms with -im at the end are the transitive versions of the same; and I expect that 'bek' and 'beck' are the same too.

I am not aware enough of the styles of all the regions to do a clean-up of the corpus but I wanted to point out the problems that I was able to spot.

Regards,

Eva (Swedish but fluent speaker of the New Ireland dialect after some three years in a village there)

hugolpz commented 3 years ago

Hello @evali1,

Following your ticket I created a curation process which should allow a volunteer to review the list and exclude words quite quicly. I expect 1000~2000 words can be reviewed per hour, so the non-native words in these items can be excluded.

I'am looking for a first user to use the process. Please follow this link. This Project is derivated and partenairing with UNILEX, which provides the raw data.