Reduce the uncommon characters in Chinese and Korean - Githubissues

ogallagher / wordsearch_generator

Multilingual wordsearch (word search) generator

https://wordsearch.dreamhosters.com

MIT License

6 stars 2 forks source link

Reduce the uncommon characters in Chinese and Korean #32

Closed GrimPixel closed 2 years ago

GrimPixel commented 2 years ago

In the demo, most characters in Chinese and Korean are obsolete so those characters don't make fun at all.

ogallagher commented 2 years ago

Convenient Korean syllable relative frequency data, 가나다순 results download.

GrimPixel commented 2 years ago

For Chinese, there are lists of frequent characters. For People's Republic of China https://zh.wikisource.org/wiki/%E9%80%9A%E7%94%A8%E8%A7%84%E8%8C%83%E6%B1%89%E5%AD%97%E8%A1%A8 For Republic of China https://zh.wikisource.org/wiki/%E5%B8%B8%E7%94%A8%E5%9C%8B%E5%AD%97%E6%A8%99%E6%BA%96%E5%AD%97%E9%AB%94%E8%A1%A8

ogallagher commented 2 years ago

Adding a alphabet_char_sets/ directory, with \n delimited character list files referenced in a new alphabet charsets member.

ogallagher commented 2 years ago

For Chinese, there are lists of frequent characters. For People's Republic of China https://zh.wikisource.org/wiki/%E9%80%9A%E7%94%A8%E8%A7%84%E8%8C%83%E6%B1%89%E5%AD%97%E8%A1%A8 For Republic of China https://zh.wikisource.org/wiki/%E5%B8%B8%E7%94%A8%E5%9C%8B%E5%AD%97%E6%A8%99%E6%BA%96%E5%AD%97%E9%AB%94%E8%A1%A8

Corresponding character set files:

ogallagher commented 2 years ago

[x] add charset control to webpage component

Use a similar widget to the controls I made for alphabet and example config file.

ogallagher commented 2 years ago

[x] example config file with pre selected charset (ex. zh-4808) loads properly

Modify WordsearchGenerator constructor to use selected charset if populated.

ogallagher commented 2 years ago

@GrimPixel as of now, the functionality should exist to handle better character sets for Chinese and Korean. Both cases still err on the side of using too many characters, but further improvement should just be tuning/customization of what's already been done.

GrimPixel commented 2 years ago

Thanks a lot for your work.

I guess I had some misunderstanding. These lists are “common character lists” instead of “character frequency lists”. So their rankings are not based on frequencies.

I have found lists that are ranked by frequency.

For Mandarin Chinese in PRC: https://lingua.mtsu.edu/chinese-computing/ https://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO
For Mandarin Chinese in ROC: http://technology.chtsai.org/charfreq/ http://technology.chtsai.org/charfreq/94charfreq.html

GrimPixel commented 2 years ago

I wonder if other languages can also implement letter frequency when generating the game.