matthew-e-brown / AC-Dataparse

Scripts for datamining dialogue files from Animal Crossing: New Horizons
GNU General Public License v3.0
5 stars 1 forks source link

Figure out parsing escape sequences #1

Open matthew-e-brown opened 4 years ago

matthew-e-brown commented 4 years ago

Most, if not all, of the messages have what I believe are many escape sequences in them. Most of them start with 00 0E bytes; \u000e when decoded to UTF-16-LE. While many of them seem to be of a common length, not quite all of them share any similarities.

For example, here is how Blathers's comments on the Goldfish are stored, once exported to JSON:

"Fish_00329": {
        "attr": "\u0000\u0000",
        "msgs": [
          "\u000e(\u0006\u0004촀",
          "\u000e",
          "\u0003\u0002\u0001\u000e2",
          "\u0004Ā촃Goldfish \u000e",
          "\u0003\u0002￿are so cute and delicate\u000en",
          "",
          "\u000e\n\u0001",
          "\n\u000e(&\u0004촀",
          "but do you know how big they\ncan get?\n\u000e(9\u0004촀",
          "Why, they can grow up to\u000en",
          "",
          "\u000e\n\u0002",
          "\n\u000e\n\n",
          "\u000e(\u001b\u0004촀",
          "\u000e",
          "\u0002\u0002–a foot in length!\u000e",
          "\u0002\u0002d\u000e\n\t",
          "\u000e\n\u0002",
          "\u000e(\u0006\u0004촀",
          "\u000e",
          "\u0004",
          "\u000e()\u0004촀",
          "Well, sometimes. \u000e\n",
          "\u0002(The size of the\ntank they're kept in tends to\nrestrict their growth.\n\u000e(\r\u0004촀",
          "And just how big will this \u000e",
          "\u0003\u0002\u0001\u000e2",
          "\u0004Ā촃goldfish\n\u000e",
          "\u0003\u0002￿get in our large museum tank?\u000e\n",
          "\u0002(\u000e(6\u0004촀",
          "\nI look forward to finding out!"
        ]
      }

These escape sequences are likely triggers for

  1. Highlighting text with different colours. See \u0004Ā촃 (bytes 00 04 01 00 CD 03) before Goldfish? That's probably to highlight "Goldfish" with green or blue.
  2. Triggering different text sizes. In many speech bubbles, some text is smaller than others.
  3. Triggering different emotions, like a "scared" emote when talking about insects.
  4. Maybe spacing? Something along the lines of "tabbing-in?" Perhaps things are centered in their text boxes manually.
  5. Other effects, like slowly typing ellipses after things like "I wonder..." in the Crucian Carp's fact blurb.

I am considering trying to decompile the game's binaries and trying to find where the game reads these files. Perhaps that will give some insight into what each of these sequences is doing? Once I figure out how to parse these escape sequences, it will become possible to automatically reformat the text in all languages, instead of manually going through the langauges I know and fixing them.'

It isn't as simple as just cutting out all the characters that need to be escaped in JSON, since a lot of the escape sequences have regular characters like ( and Ā촃 in them.

matthew-e-brown commented 4 years ago

Was watching some Breath of the Wild speedruns and remembered randomly that apparently Nintendo has used .msbt files for their dialogue for a long time. So, with a little bit more specific Googling than I had been doing before...

I found this page! It mentions "Text Commands" with some details about a few of them. This should give me enough starter information to start using footage of Blathers talking to piece together what each 00 0E command does.

Thanks, smallant1!

matthew-e-brown commented 4 years ago

Lots of development today. All that's left to do now is figure out what each one of these commands do. These discoveries will be documented in commits to Notes.md.

edcrfv0 commented 4 years ago

Some more things I've found out as I've been researching this for the Pokémon HOME's msbt files (it might be different from AC but I guess most of it could be related).

The main structure is 4 shorts: 0E 00, Command type, Command variant, number of subsequent bytes.

As you wrote, Command type 00 00 are text modifiers. Command variant 03 00 is color change, followed by 04 00 and 4 bytes (RGBA). Command variant 02 00 seems to be a font change. I have one occurence of this in a chinese text just before the word "Nintendo" (in latin characters).

Command types 01 00 and 02 00 are variables. I guess the variant tells which one. It's followed by 02 00 and 2 bytes. In Pokémon HOME, for command type 01 00, the first byte is 00 or 01 and the second one CD. For command type 02 00, first byte is in 00, 01, 02, 03, 04 while the second one is in 00, 01, 02, 03, 05, CD.

Command types 13 00 to 19 00 seem to be language dependant. 13 00 for English, 14 00 for French, 15 00 for Italian, 16 00 for German, 17 00 for Spanish and 19 00 for Korean. When command variant is 01 00, it is a singular/plural switch. Variant code is followed by the usual subsequent bytes count, then 00 CD, then 2 UTF16 strings, each of them starting with a byte count.

Command types 32 00 and 33 00 are special characters. The command variant being some index. Followed by 00 00 as subsequent bytes count. In the ATR1 table, I could find these character codes:

0E00 3200 0200 0000 [Character1:male ] 0E00 3200 0300 0000 [Character1:female ] 0E00 3300 0200 0000 [Character2:L_DoubleQuot. ] 0E00 3300 0300 0000 [Character2:R_DoubleQuot. ] 0E00 3300 0600 0000 [Character2:StraightSingleQuot. ] 0E00 3300 0700 0000 [Character2:StraightDoubleQuot. ] 0E00 3300 0800 0000 [Character2:HalfSpace ] 0E00 3300 0900 0000 [Character2:QuarterSpace ] 0E00 3300 1200 0000 [Character2:null ]

shane-tw commented 3 years ago

I think ms_tags.h (included in some titles, e.g. 3ds system ones) tells you what the bytes mean - although they can vary

#define MSTAGGROUP_System 0x0
#define MSTAGGROUP_CTR_built_in 0x1

// tags in group "System"
#define MSTAG_System_Ruby 0x0
#define MSTAG_System_Font 0x1
#define MSTAG_System_Size 0x2
#define MSTAG_System_Color 0x3
#define MSTAG_System_PageBreak 0x4

The first byte after 0x0e is the tag group byte, e.g. 0x00 which is System. The next byte after that is the command itself. Font color (0x03) and font size (0x02) match up with what @edcrfv0 said above.

I'm fairly certain Ruby is referring to this, but I've yet to see it being used in the wild so I'm not sure what bytes it expects. Best bet is to source a ton of Japanese Nintendo games using MSBT and hope one of them uses it. I've yet to see PageBreak be used.

matthew-e-brown commented 3 years ago

Thanks @edcrfv0 and @shanepm. When I eventually get around to working on AC stuff again I'll be sure to take a deeper look at this and add it to Notes.md. What you've found looks very promising and interesting.

I'm very happy to see a repository of mine actually see some use! ...I've just been a bit busy working on other projects and school. It doesn't go unnoticed, though. 😁

shane-tw commented 3 years ago

Some more notes: It definitely seems like 0x00 means ruby, and 0x04 looks like page break (0e00000004000000). A few examples of 0x00 below (I added in the hyphens/2d00 to make it easier to separate them) Note: These contain more bytes than the commands, I haven't trimmed the extra.

0e0000000000080002000400613044300f5c553044306a304c308930003068306b304b304f30
2d002d002d002d002d002d00
0e00000000000600020002005b30cc8073308c3001ff
2d002d002d002d002d002d00
0e00000000000a0004000600823088304630216ad8696e30
2d002d002d002d002d002d00
0e00000000000a00020006004b3089306030534f01ff
2d002d002d002d002d002d00
0e000000000008000200040044308d3072826e30d230ec3001ff
2d002d002d002d002d002d00
0e00000000000a00040006004b306e304630ef53fd806a30
2d002d002d002d002d002d00
0e00000000000800020004004b304e3050968a30
2d002d002d002d002d002d00
0e00000000000a000400060057305c309330ea8136716a30
2d002d002d002d002d002d00
0e00000000001600080012005b3044305f3044304b3093304d30873046301f754b61b07483586e30823068306730
2d002d002d002d002d002d00
0e00000000000a000200060059304c305f30ff599230
2d002d002d002d002d002d00
0e00000000000c00040008004b304f306b309330ba788d8a57305f3044306e30673059304c30
2d002d002d002d002d002d00
0e00000000000a0002000600573085309330ec6563306630
2d002d002d002d002d002d00
0e00000000000a00040006007e30753086301f77ac516a3093306730593088306d30
2d002d002d002d002d002d00
0e000000000008000200040055308030d25b44306e306f306130873063306830

Converted to hex, that's:

......ちい小さいながら とにかく------......せ背びれ!------...
..もよう模様の------...
..からだ体!------......いろ色のヒレ!------...
..かのう可能な------......かぎ限り------...
..しぜん自然な------......せいたいかんきょう生態環境のもとで------...
..すがた姿を------......かくにん確認したいのですが------...
..しゅん旬って------...
..まふゆ真冬なんですよね------......さむ寒いのはちょっと

Paste that onto Furigana Maker and you get this - note how the kana on the left matches with above the kanji. image

Looking at those bytes I see this command syntax:

  1. number of remaining utf8 bytes relating to this command
  2. kanji_len_u8
  3. kana_len_u8
  4. Kana
  5. Kanji
matthew-e-brown commented 3 years ago

Looks like you're right. Great find! Although, it also looks like that Furigana Converter is outputting some funky things. I think that's just an error with the site, though—it's repeating things.

If you notice, ちい小さいながら should be 小さいながら — いろ色のひれ! should be 色のひれ! — さむ寒いのはちょっと should be 寒いのはちょっと — etc... Handy I've been learning Japanese...

shane-tw commented 3 years ago

That's because I sent ちい小さいながら to the site - the raw text from the command plus some extra, so the site was right. The very first command is this:

0e00 - Marker
0000 - System
0000 - Ruby
0800 - 8 utf8 bytes from 0200 to 4430 below
0200 - kanji is 2 utf8 bytes
0400 - kana is 4 utf8 bytes
61304430 - ちい - kana
0f5c - 小 - kanji
matthew-e-brown commented 3 years ago

Oh, that's very interesting... You'd think they'd just have to store ちい and 小 together, since the rest is just displayed as written...

You said you'd yet to see one of these appear "in the wild" in your original comment: is this example pulled from ACNH? If not, I'll see if I can find one in the Japanese MSBT's from it.

shane-tw commented 3 years ago

When I originally commented I'd only looked at 3ds system MSBTs. Those later examples are from ACNH :slightly_smiling_face:

kenshiisod commented 3 years ago

Hi, it's me again (alt account). The wiki here was updated some more.

It turns out the control sequences all follow the same format, so you can convert them to readable format: Examples: SP_owl_Comment_Insect.po, SP_ItemName_30_Insect.po , SYS_Get_Fish.po

Normally Nintendo formats them like [System::Color name="red" ] or in bytes e.g. [00:03 bytes="0100" ] or similar, so maybe reading those files will help understand what the sequences should be named / what they do?

paulzhn commented 1 year ago

Thanks, your issue helps me a lot!