veekun / pokedex

more than you ever wanted to know about Pokémon
MIT License
1.44k stars 637 forks source link

Use of language tag ja-Hrkt is inconsistent #256

Open SethETaron opened 5 years ago

SethETaron commented 5 years ago

As the name suggests, and according to ISO standards, records associated with the language tag ja-Hrkt should include only hiragana and katakana. All records consisting of kanji or a mix of kanji and hiragana and/or katakana should instead be associated with the language tag ja, or internal language ID 11. We are instead associating many records with internal language ID 1, even when they contain kanji. Examples of this can be found in version_names.csv, region_names.csv, pokemon_color_names.csv, and several other files.

It would be good to refactor this by adding records associated with language ID 11 for all tables that have language ID 1 records but no language ID 11 records. I can't think of a scenario where it makes sense to have a ja-Hrkt record but no ja record. The ja record should contain the most standard way to write the data, and the ja-Hrkt record should contain the all hiragana/katakana form of this. This would make the Japanese data much more standardized and easier to use.

Here's an example of how a refactored version of one of these files would look: pokemon_color_id,local_language_id,name 1,1,くろい 1,5,Noir 1,6,Schwarz 1,8,Nero 1,9,Black 1,11,黒い 2,1,あおい 2,5,Bleu 2,6,Blau 2,8,Blu 2,9,Blue 2,11,青い 3,1,ちゃいろ 3,5,Brun 3,6,Braun 3,8,Marrone 3,9,Brown 3,11,茶色 4,1,はいいろ 4,5,Gris 4,6,Grau 4,8,Grigio 4,9,Gray 4,11,灰色 5,1,みどり 5,5,Vert 5,6,Grün 5,8,Verde 5,9,Green 5,11,緑 6,1,ピンク 6,5,Rose 6,6,Rosa 6,8,Rosa 6,9,Pink 6,11,ピンク 7,1,パーパル 7,5,Violet 7,6,Violett 7,8,Viola 7,9,Purple 7,11,パーパル 8,1,あかい 8,5,Rouge 8,6,Rot 8,8,Rosso 8,9,Red 8,11,赤い 9,1,しろい 9,5,Blanc 9,6,Weiß 9,8,Bianco 9,9,White 9,11,白い 10,1,きいろ 10,5,Jaune 10,6,Gelb 10,8,Giallo 10,9,Yellow 10,11,黄色

As you can see, there's some redundancy between some of the records, and I'm not sure if that's acceptable, but especially in a small table like this, it seems a small price to pay for consistency.

sdcinglis commented 5 years ago

Don't quote me on this, but I believe the language options are parsed from the actual game data, meaning this is how Game Freak use these language options. The iso codes may have been added for convenience sake. I'm not sure why some fields don't have every language, or even every permutation of the same language though.

magical commented 3 years ago

There was originally no distinction between ja and ja-Hrkt. Language ID 1 was ja. When the games added the option to switch between kana and kanji text (Gen 5?), we added a new language ID for kanji and changed the old one to ja-Hrkt, on the grounds that the games had mostly stuck to kana up until that point. I guess some tables with translations not ripped from the games (like the colors) were using kanji though. They should be updated with the new language ID.

Here's a complete list of ja-Hrkt text containing kanji characters: https://gist.github.com/magical/2c47e9605de7f1b22981cefd2b812ea8. I'm not sure what's going on with location_names, but other than that they should be easy to fix up.