skishore / makemeahanzi

Free, open-source Chinese character data
https://www.skishore.me/makemeahanzi/
Other
1.85k stars 464 forks source link

Stroke order for traditional and simplified Chinese are sometimes different #1

Closed DanielChu closed 6 years ago

DanielChu commented 8 years ago

Stroke orders for characters such as 問 are different for simplified and traditional Chinese. See: https://en.wikipedia.org/wiki/Stroke_order

Probably need to adjust the format of graphics.txt to take this into account or create seperate graphics.txt. It might be pretty easy to fix this if the traditional Chinese font do have a different order than the simplified Chinese font for the same character.

skishore commented 8 years ago

Thanks for the report. Yes, all the stroke orders that I've computed here are based on PRC stroke order. When I started this project I was unaware of these differences, although I learned some about them through the work.

Do you know of a reference that has a comprehensive list of differences between the various stroke orders? That Wikipedia page only gives the 問 example. I know of the grass radical as well, but I wasn't able to find a complete list by browsing around.

I added a Future Work section to the README with a list of potential improvements. Is this data that you would use if it were available?

DanielChu commented 8 years ago

I don't know where there is a complete list of the differences. I think the best way forward might just to adjust the graphics.txt format so in the future people can submit alternate stroke orders for the same character marking which type of order it is (PRC, HK, TW, JP). I think it definitely can be useful since Hong Kong and Taiwan still teaches using the traditional stroke orders.

skishore commented 6 years ago

You may want to check out @parsimohni's work on Japanese stroke order data. They used this dataset as a starting point but did a bunch of extra work to add Japanese characters that don't have corresponding hanzi and to deal with stroke order differences: https://github.com/parsimonhi/animCJK

I don't have the time to do similar work for all the various stroke orders myself, but I hope that others can fill in those gaps. In particular, if I can provide rigorously curated PRC stroke order data, it should be possible to automate producing other types of stroke order data.

hugolpz commented 6 years ago

Hey, @Skishore, I have some good expertise on this mater. I created and cowrote most of the section relate to Stroke order per polity. I knew well the source back in 2008~10. Also, as of 2018 :

  1. Can your software handle multiple stroke order ? {PRC=default|t|j|k|h} ?
  2. Do you still need a list of radicals with their official stroke order per polity ? (I could compile one within the year)
  3. What are your advancement on this front ?

EDIT: Move to /parsimonhi/animCJK/issues/1

skishore commented 6 years ago

This project only includes PRC stroke order, and I don't have plans to make data for other orders. The animCJK project has t and j orderings for 1k-2k of these characters, though.

There's enough data in this project's output that it should be possible to automate the generation of orderings for other characters. For example, for all characters I have here, I have both the stroke order graphics and a "matches" field that shows how the strokes in a given character map to strokes in its components. If you were to change the stroke order of a component (not necessarily a radical), you could use those "matches" decompositions to automatically infer stroke order changes for all characters using that component.

I used a similar process to produce candidate stroke orders in this project itself, and it sped things up by a lot - for the most part, I just had to go through and do a quick verification of the resulting order.

hugolpz commented 6 years ago

Hahahaha. Witty. Do you know of the CCDL's CDL ? They also use heredity/cascading, and they are the source project for Unicode's Unihan shapes. They are cool crazies with 80k characters designed with cascading in mind, from <50 strokes to ~1000 graphic elements to 80k characters. Their description paper is short, 6 pages, and quite cool to read.

Ok, as for the polity, I will check with AnimCJK if they need a crosscheck or support from myself and our other CJK nerds ^^. cc: @parsimonhi

PS: I'am catching up with your projects and efforts, my apologizes for my many questions, but it's for greater good :+1:

skishore commented 6 years ago

Yes, all of the "matches" decompositions are written in CDL! I started with the Unihan CDL codes but curated them by hand, though, so I think they're more accurate than other sources.

parsimonhi commented 6 years ago

AnimCJK takes into account that a same unicode can have several glyphs, stroke numbers or stroke orders. The solution is simple: character files are duplicated and modified as necessary. For instance, the character "王" (U+0738b/王) has two different stroke orders. The corresponding file 29579.svg in svgsJa repository (i.e. for Japanese) is not the same as in svgsZhHans (i.e. for simplified Chinese): the second and third strokes are swapped.

At the moment, 2998 Japanese characters are in the svgsJa repository, and 3538 characters are in the svgsZhHans repository. II am working on the Taiwanese version of a set of 4808 frequently-used characters but I am far from completing the task. I didn't consider other character sets at the moment.

I did my best to check the data from various sources. However, I cannot guarantee that aniwCJK is error free.

About the CCDL's CDL, is it multiple language? Or just focused on Chinese? I didn't find the information.

Anyway, stroke order is a difficult issue. There are inconsistencies (or errors?) everywhere. Even the number of stroke is not always well defined. For instance, the radical 阝 has 2 strokes? 3 strokes? and in which character set? Moreover, you cannot always rely on component decomposition to automatically derivate stroke number and stroke order. For instance, in Japanese, sometimes the component 牙 has 5 strokes as in 芽 and sometimes it has 4 strokes as in 穿 (and as in simplified Chinese).

There are also some changes made time to time. For instance, there was a Japanese reform JIS X 0213 made in 2004 (see https://kakijun.jp/main/jis2004.html) which changes a signifiant number of characters (as a result, KanjiVG which is one of the best reference for Japanese is not up-to-date at the moment). So, i don't think that there is a single source up-to-date for all characters and all languages.

hugolpz commented 6 years ago
parsimonhi commented 6 years ago

The folder for ROC will be named svgsZhTw. A ZhHant character is not always the same as a zhTw character. Too simple otherwise, don't you think :-)

hugolpz commented 6 years ago

(Ping : EDIT done upper.) Note: 張&張 (2013:pp22-25) cites 32 CN vs TW stroke order variations, and list their cascading impacts.

p20 is interesting (you can read chinese?).

screenshot from 2018-01-22 20-21-58

~ Same shape Diff shape Sum
Same order 2407 709 3,116
Diff order 383 1,309 1,692
Sum 2,790 2,018 4,808
parsimonhi commented 6 years ago

Thanks for the link: it looks very interesting.