studioego / cjklib

Automatically exported from code.google.com/p/cjklib
Other
0 stars 0 forks source link

Frequency data for Characters and Readings #9

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Frequency data is important in various applications of linguistic data,
e.g. sorting or searching. For CJK there exist several sources of
frequency data built from large corpora. As the selection of the corpus
highly influences calculated frequencies cjklib should not focus on a
single corpus, but allow for a general scheme that allows the user to
select an appropriate source.

Possible sources:
  - Unihan for reading frequencies
  - GPL Pinyin frequencies, http://technology.chtsai.org/syllable/
  - Jun Da's lists (http://lingua.mtsu.edu/chinese-
  - Frequencies for Chinese http://technology.chtsai.org/charfreq/,
unclear license

cjklib is LGPL and should stay this way. Mixing of non-commercial licenses
is not possible and even GPL sources should be considered carefully. The
data doesn't necessarily need to be shipped though, a TableBuilder can be
created allowing the user to add the data later, if requested.

CharacterDomains that are already implemented could be considered a
similar feature. They depend on defining sources and are offered through a
consistent abstraction. Frequency data could thus be implemented in a
similar fashion.

Original issue reported on code.google.com by christop...@gmail.com on 13 Aug 2009 at 9:36

GoogleCodeExporter commented 9 years ago
With r168 cjklib now includes a Frequency column for Table CharacterXHPCPinyin 
which
holds the Xiandai Hanyu Pinlu Cidian pronunciation and frequency data (see
http://www.unicode.org/reports/tr38/tr38-6.html#kHanyuPinlu).

This frequency is given for the character/reading pair while other frequency 
data
might be given for characters or reading only.

Original comment by christop...@gmail.com on 16 Aug 2009 at 11:58