unsound / hfsexplorer

HFSExplorer - An application for accessing HFS/HFS+/HFSX file systems. License: GPLv3+
https://www.catacombae.org/hfsexplorer/
286 stars 37 forks source link

Extract data problem when using MacJapanese encoding due to out-of-range characters included in filenames #26

Closed gingerbeardman closed 3 years ago

gingerbeardman commented 3 years ago

Reproduce:

  1. load this ISO https://archive.org/details/cd-rom-maclife-31
  2. set encoding to MacJapanese
  3. right click root node
  4. choose "Export data"

Screenshot

Screen shot 2021-10-26 at 13 49 53

Note:

Another ISO that has the same problem

An example that can be exported OK in MacJapanese

Aside:

gingerbeardman commented 3 years ago

MACLIFE31

The error above is when it gets to this folder:

Screen shot 2021-10-26 at 14 07 48

Here's the same error when I drill into the folder manually using the directory tree:

Screen shot 2021-10-26 at 14 07 19

In classic Mac OS System 7 that folder contains one file:

Screen shot 2021-10-26 at 14 09 40

We can see some interesting characters if I copy and paste the filename:

This problem, "byte 16" 0x7f57, refers the 7F character before the 57 (W).

The bytes as seen in a hex editor:

Screen shot 2021-10-26 at 14 40 17

Of course there are two other similar instances in this one filename, and many others elsewhere on this, and across other discs that I have.

0x7F is "non-printable control character" DEL

from ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT

# Control character mappings are not shown in this table, following # the conventions of the standard UTC mapping tables. However, the # Mac OS Japanese encoding uses the standard control characters at # 0x00-0x1F and 0x7F.

gingerbeardman commented 3 years ago

So I think the DEL 0x7f character should be ignored.

My workaround in my text files is to strip the characters: hls -1aR | perl -ne 's/[\x7f]//g;print $_;' > file.txt

Which gives me valid MacJapanese text afterwards. Woo! But only fixes some files. Boo!

gingerbeardman commented 3 years ago

Decided to blast through a bunch more ISOs with the help of some automation.

Bytes starting:

All generate errors with MacJapanese selected.

If you're interested I can share more files, info, etc.

gingerbeardman commented 3 years ago

Interestingly if I take the problematic text and convert it from MacJapanese to MacJapanese using macOS frameworks, the resulting text is free from strange characters.

I used https://github.com/andreberg/encodings-tool to do this.

So, it seems the macOS API for working with text does more than it might appear.

unsound commented 3 years ago

Thanks for debugging the issue this far, I'll check the provided example as soon as I have time.

unsound commented 3 years ago

@gingerbeardman I pushed some changes to the proposed branch that should be helpful when decoding these out-of-place filenames. I don't quite understand how these characters ended up in MacJapanese-encoded filenames, or any filenames for that matter (as they are control characters / characters outside the character set). However as a solution we now fall back to MacRoman mapping for those characters that don't have a mapping in MacJapanese.

gingerbeardman commented 3 years ago

Just checkout the proposed branch and it works wonderfully, thank you. I can only assume that's what Mac OS does when it encounters funkiness like this?

Regarding how these characters get in the filenames, Japanese is typed in a strange way involving presses of sequences of letter and modifier keys to achieve specific characters from each of their three alphabets. There were also third party input methods available that may have allowed typing invalid characters like this (DEL in particular is easy to type, but should not accepted/allowed/processed as text). I've managed to type some funky stuff but not reproduce one of these faulty filenames.

Next for me is to look at directly listing export. My goal from all of this!

gingerbeardman commented 3 years ago

Just to expand on this:

DEL in particular is easy to type, but should not accepted/allowed/processed as text

This is very easy to reproduce in a filename:

  1. no Japanese input method required
  2. enter file renaming mode (press enter)
  3. press forward delete key
  4. use the cursor keys or backspace to go through the text and you'll feel the hidden character
  5. copy and paste the filename and inspect the raw data for 0x7F using Hex Editor (below)
Screen shot 2021-10-29 at 16 42 41

Screen recording:

  1. i step through with regular rhythm
  2. i step through the word image to show there are only 5 characters
  3. i press forward delete when cursor is next to the m character to insert a DEL
  4. i backspace through the word, note the delay when the invisible DEL character is deleted

https://user-images.githubusercontent.com/49612/139464296-8c1eb11f-4d54-4b1d-b808-a88e32b85ce7.mov

gingerbeardman commented 3 years ago

it looks like there are some Shift-JIS characters in these "MacJapanese" discs too 😩

From ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT

  • A user-defined range using Shift-JIS code points 0xF040-0xFCFC, providing 2444 code points.

Currently these are incorrectly decoded as MacRoman.

unsound commented 3 years ago

This is an omission, I guess I assumed the table would cover all the mapped code points but you have to read the comments carefully as well:

# 1. Mapping the user-defined range
#
#    The table below covers only the standard Mac OS Japanese encoding.
#    It does not include mappings for the Shift-JIS user-defined range;
#    this is mapped onto Unicodes 0xE000-0xE98B as follows:
#      0xF040-0xF07E -> 0xE000-0xE03E
#      0xF080-0xF0FC -> 0xE03F-0xE0BB
#      0xF140-0xF17E -> 0xE0BC-0xE0FA
#      0xF180-0xF1FC -> 0xE0FB-0xE177
#      ...
#      0xFC40-0xFC7E -> 0xE8D0-0xE90E
#      0xFC80-0xFCFC -> 0xE90F-0xE98B
unsound commented 3 years ago

@gingerbeardman I pushed some fixes earlier today, please check if they improve things for your images.

gingerbeardman commented 3 years ago

Thanks, that's much improved!

A few oddities:

The repeated digits strike me as odd.

These point to the symbols/dingbats block of Shift-JIS, but there are no characters at this specific locations. No idea what should be done with those.

unsound commented 3 years ago

Java's own Shift-JIS implementation fails to decode those, but iconv has possible matches:

$ printf "\x85\x59\n\x86\x6d\n\x87\x70\n" | iconv -f 'SHIFT_JISX0213' -t 'UTF-8'
Ã
ɚ̀
㎝

I just don't know where those matches come from...

Edit: Actually that's specifically for this X0213 version of Shift-JIS, which is an extension of regular Shift-JIS. The extension was introduced in 2004 so I doubt that HFS filenames are supposed to be interpreted in that charset. Even using MacJapanese through CoreFoundation doesn't find any match for these sequences.

gingerbeardman commented 3 years ago

It's a mystery.

Classic Macintosh System 7 displays them as unknown character glyphs.

...I'm wondering if the repeated digits could be mastering errors? So they'd be more like:

Just a thought.

unsound commented 3 years ago

Well... I can't imagine how that kind of error would have been introduced because it effectively would have mirrored the least significant 4-bit part of the previous byte into the most significant 4-bit part of the next byte, effectively shifting the high 4 bits to the low 4 bits of that byte. It just doesn't follow any logical error pattern that I've ever seen.

gingerbeardman commented 3 years ago

Interesting update: I found that Tcl's encoding functions include MacJapanese (as "macJapan") and it deals with all of this without a single issue.

$ hls -1ablRN | ./convert2unicode.tcl 

convert2unicode.tcl: https://gist.github.com/gingerbeardman/4a3b66236e018b72b32ca17953474e12

So I thought you might be interested in how Tcl does things.

gingerbeardman commented 3 years ago

Closer inspection seems to show Tcl doesn't get it all correct, it just fails silently. A future version of Tcl will show errors, see here.

unsound commented 3 years ago

Alright then I think we have exhausted all possibilities for now to try and decode those sequences. I'm closing this but feel free to reopen if you think there's anything that can be improved here.

gingerbeardman commented 3 years ago

That's fine by me.

All that remains is for me to say thank you for your attention and assistance on this.

gingerbeardman commented 3 years ago

Tcl uses the following Apple technique https://opensource.apple.com/source/tcl/tcl-10/tcl/tools/encoding/macJapan.txt

It gets it all correct as long as you operate in binary mode (which I was neglecting to do yesterday)

unsound commented 3 years ago

@gingerbeardman So TCL does find matches for 0x8559, 0x866d and 0x8770? I don't see any of those in the macJapan.txt table that you linked above. What Unicode code points does it map those sequences to? (Decoding to UTF-32 should yield the exact Unicode code points.)

gingerbeardman commented 3 years ago

Good question! Of course I should have made a note which discs had those bytes. It took a while to find them again.

By "correct" I meant "it processes non-Japanese characters without error (and probably incorrectly)" but all Japanese characters appear as expected. Apologies.

Tcl just added a new encoding for me (based on the Apple/Unicode file JAPANESE.TXT which is subtly different to the existing macJapan.txt) but I'm yet to test this. I will when I figure out where/how to get the current Tcl source commits.

Looking into these byte combinations two out of three of them (0x866d, 0x8770) seem to be pairs of non-Japanese characters being misinterpreted as Japanese multi-byte combinations. Not sure what 0x8559 is supposed to be.

The existing macJapan Tcl encoding deals with them as follows:

0x8559

0x866d

0x8770

Supporting Files

  1. full RAW bytes output of hfsutils' hls command
  2. trimmed RAW bytes output of hfsutils' hls command
  3. UTF-8 generated by Tcl encoding encodefrom
  4. UTF-32 generated by iconv using above

Download: files.zip

gingerbeardman commented 3 years ago

Using https://github.com/gingerbeardman/encodings-tool I generated a list of what 0x8559 converts to in all supported encodings on modern macOS:

BYTES=`cat 8559.bin`; encodings $BYTES | grep -v "null" > file.txt

encodings-not-null-0x8559.txt

Not much help, perhaps, but worth a try.

gingerbeardman commented 2 years ago

Just seen this screenshot in the wild on a Japanese for sale site. Thought you might like to see it!?

Screenshot of a screenshot: image