Data gathering - Githubissues

parlr / ruby-font-creator

Generate rich Unicode open fonts with custom annotations, transliterations, pronunciations.

21 stars 3 forks source link

Data gathering #19

Closed hugolpz closed 7 years ago

hugolpz commented 7 years ago

We currently look for database with{ "glyph": "西", "phonetic": "xī" } (or xi1, or alternatives).

Sources possible, info to complete :

Moedict

[ ] link (to complete)
[ ] json format
[ ] range : most common caracters, trad only ?
Unicode :
[ ] link (to complete)
[x] xml
[x] range : traditional/modern ; -more complete for a font
[ ] which phonetic format its provided also. ("glyph": "西", "phonetic": "xī" or xi1 ?)

CJKlib

[x] link

edouard-lopez commented 7 years ago

What about Unihan?

With the hexadecimal codepoint we can get the glyph like this in Python:

>>> print(chr(int('0x897F', 16)))
西

A JS solution would be better, but this is out of the scope of the project, we can do it anyway we think fits.

hugolpz commented 7 years ago

Please check out :

unihan results -- npm unihan -- npm convertPinyin -- npm unihan-cjk

screenshot from 2017-03-16 17-59-36

PinNum2PinTones -- npm pinyinize -- pinyin-string -- best one !

edouard-lopez commented 7 years ago

Thanks for the link cjk-unihan might be useful for other projects.

I think it's better to limit the project to generating font and outsource the data gathering/validation to another project. This way we stay focus and efficient.

I'm closing as different users might have different needs hence handcraft their dictionaries.

edouard-lopez commented 7 years ago

I reckon the JS solution is in tobei/unihan code

const character = String.fromCodePoint(parseInt(code.substring(2), 16));

hugolpz commented 7 years ago

Did you gathered the data ?

edouard-lopez commented 7 years ago

Not yet, could you work on a project to do so?

hugolpz commented 7 years ago

screenshot from 2017-03-17 11-16-25

edouard-lopez commented 7 years ago

@hugolpz I think you have a typo in your comment, there is a ratio of 1:10 between node-pinyin and unihan characters/phonetic pairs. Can you confirm/correct this number?

hugolpz commented 6 years ago

https://github.com/superbiger/pinyin4js/blob/master/src/dict/pinyin.dict.js

edouard-lopez commented 6 years ago

We can get the codepoint using punycode

parlr / ruby-font-creator

Data gathering #19

Moedict

Unicode :

CJKlib