Extract data problem when using MacJapanese encoding due to out-of-range characters included in filenames

gingerbeardman commented 3 years ago

Reproduce:

load this ISO https://archive.org/details/cd-rom-maclife-31
set encoding to MacJapanese
right click root node
choose "Export data"

Screenshot

Note:

if the encoding is set to MacRoman the export works, but the filenames are incomprehensible
classic Mac OS System 7 does not complain about these discs/filenames

Another ISO that has the same problem

https://archive.org/details/nikkei-mac-cd-rom-1994-2-17

An example that can be exported OK in MacJapanese

https://archive.org/details/cd-rom-maclife-34

Aside:

could this disc be using something not quite MacJapanese? (I read that there are several variations but maybe it's something more than that)
I believe this is the reason same reason I can't get good text dumps of the directory listings from hfsutils

gingerbeardman commented 3 years ago

MACLIFE31

The error above is when it gets to this folder:

Here's the same error when I drill into the folder manually using the directory tree:

In classic Mac OS System 7 that folder contains one file:

We can see some interesting characters if I copy and paste the filename:

Micro Dry™ 3.0a W/O Prf.sea
filename.raw.txt

This problem, "byte 16" 0x7f57, refers the 7F character before the 57 (W).

The bytes as seen in a hex editor:

Of course there are two other similar instances in this one filename, and many others elsewhere on this, and across other discs that I have.

0x7F is "non-printable control character" DEL

from ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT

# Control character mappings are not shown in this table, following # the conventions of the standard UTC mapping tables. However, the # Mac OS Japanese encoding uses the standard control characters at # 0x00-0x1F and 0x7F.

gingerbeardman commented 3 years ago

So I think the DEL 0x7f character should be ignored.

My workaround in my text files is to strip the characters: hls -1aR | perl -ne 's/[\x7f]//g;print $_;' > file.txt

Which gives me valid MacJapanese text afterwards. Woo! But only fixes some files. Boo!

gingerbeardman commented 3 years ago

Decided to blast through a bunch more ISOs with the help of some automation.

Bytes starting:

0x00
0x7f
0x85
0x87
0x8e
0x9f
0xf0
0xfb

All generate errors with MacJapanese selected.

If you're interested I can share more files, info, etc.

gingerbeardman commented 3 years ago

Interestingly if I take the problematic text and convert it from MacJapanese to MacJapanese using macOS frameworks, the resulting text is free from strange characters.

from: Micro Dry™ 3.0a W/O Prf.sea
to: Micro Dry™ 3.0a W/O Prf.sea

I used https://github.com/andreberg/encodings-tool to do this.

So, it seems the macOS API for working with text does more than it might appear.

unsound commented 3 years ago

Thanks for debugging the issue this far, I'll check the provided example as soon as I have time.

unsound commented 3 years ago

@gingerbeardman I pushed some changes to the proposed branch that should be helpful when decoding these out-of-place filenames. I don't quite understand how these characters ended up in MacJapanese-encoded filenames, or any filenames for that matter (as they are control characters / characters outside the character set). However as a solution we now fall back to MacRoman mapping for those characters that don't have a mapping in MacJapanese.

gingerbeardman commented 3 years ago

Just checkout the proposed branch and it works wonderfully, thank you. I can only assume that's what Mac OS does when it encounters funkiness like this?

Regarding how these characters get in the filenames, Japanese is typed in a strange way involving presses of sequences of letter and modifier keys to achieve specific characters from each of their three alphabets. There were also third party input methods available that may have allowed typing invalid characters like this (DEL in particular is easy to type, but should not accepted/allowed/processed as text). I've managed to type some funky stuff but not reproduce one of these faulty filenames.

Next for me is to look at directly listing export. My goal from all of this!

gingerbeardman commented 3 years ago

Just to expand on this:

DEL in particular is easy to type, but should not accepted/allowed/processed as text

This is very easy to reproduce in a filename:

no Japanese input method required
enter file renaming mode (press enter)
press forward delete key
use the cursor keys or backspace to go through the text and you'll feel the hidden character
copy and paste the filename and inspect the raw data for 0x7F using Hex Editor (below)

Screen recording:

i step through with regular rhythm
i step through the word image to show there are only 5 characters
i press forward delete when cursor is next to the m character to insert a DEL
i backspace through the word, note the delay when the invisible DEL character is deleted

https://user-images.githubusercontent.com/49612/139464296-8c1eb11f-4d54-4b1d-b808-a88e32b85ce7.mov

gingerbeardman commented 3 years ago

it looks like there are some Shift-JIS characters in these "MacJapanese" discs too 😩

0xf052
0xfb43
0xfbe0

From ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT

A user-defined range using Shift-JIS code points 0xF040-0xFCFC, providing 2444 code points.

Currently these are incorrectly decoded as MacRoman.

unsound commented 3 years ago

This is an omission, I guess I assumed the table would cover all the mapped code points but you have to read the comments carefully as well:

# 1. Mapping the user-defined range
#
#    The table below covers only the standard Mac OS Japanese encoding.
#    It does not include mappings for the Shift-JIS user-defined range;
#    this is mapped onto Unicodes 0xE000-0xE98B as follows:
#      0xF040-0xF07E -> 0xE000-0xE03E
#      0xF080-0xF0FC -> 0xE03F-0xE0BB
#      0xF140-0xF17E -> 0xE0BC-0xE0FA
#      0xF180-0xF1FC -> 0xE0FB-0xE177
#      ...
#      0xFC40-0xFC7E -> 0xE8D0-0xE90E
#      0xFC80-0xFCFC -> 0xE90F-0xE98B

unsound commented 3 years ago

@gingerbeardman I pushed some fixes earlier today, please check if they improve things for your images.

gingerbeardman commented 3 years ago

Thanks, that's much improved!

A few oddities:

0x8559
0x866d
0x8770

The repeated digits strike me as odd.

These point to the symbols/dingbats block of Shift-JIS, but there are no characters at this specific locations. No idea what should be done with those.

unsound commented 3 years ago

Java's own Shift-JIS implementation fails to decode those, but iconv has possible matches:

$ printf "\x85\x59\n\x86\x6d\n\x87\x70\n" | iconv -f 'SHIFT_JISX0213' -t 'UTF-8'
Ã
ɚ̀
㎝

I just don't know where those matches come from...

Edit: Actually that's specifically for this X0213 version of Shift-JIS, which is an extension of regular Shift-JIS. The extension was introduced in 2004 so I doubt that HFS filenames are supposed to be interpreted in that charset. Even using MacJapanese through CoreFoundation doesn't find any match for these sequences.

gingerbeardman commented 3 years ago

It's a mystery.

Classic Macintosh System 7 displays them as unknown character glyphs.

...I'm wondering if the repeated digits could be mastering errors? So they'd be more like:

0x859n
0x86dn
0x870n

Just a thought.

unsound commented 3 years ago

Well... I can't imagine how that kind of error would have been introduced because it effectively would have mirrored the least significant 4-bit part of the previous byte into the most significant 4-bit part of the next byte, effectively shifting the high 4 bits to the low 4 bits of that byte. It just doesn't follow any logical error pattern that I've ever seen.

gingerbeardman commented 3 years ago

Interesting update: I found that Tcl's encoding functions include MacJapanese (as "macJapan") and it deals with all of this without a single issue.

$ hls -1ablRN | ./convert2unicode.tcl

convert2unicode.tcl: https://gist.github.com/gingerbeardman/4a3b66236e018b72b32ca17953474e12

So I thought you might be interested in how Tcl does things.

gingerbeardman commented 3 years ago

Closer inspection seems to show Tcl doesn't get it all correct, it just fails silently. A future version of Tcl will show errors, see here.

unsound commented 3 years ago

Alright then I think we have exhausted all possibilities for now to try and decode those sequences. I'm closing this but feel free to reopen if you think there's anything that can be improved here.

gingerbeardman commented 3 years ago

That's fine by me.

All that remains is for me to say thank you for your attention and assistance on this.

gingerbeardman commented 3 years ago

Tcl uses the following Apple technique https://opensource.apple.com/source/tcl/tcl-10/tcl/tools/encoding/macJapan.txt

It gets it all correct as long as you operate in binary mode (which I was neglecting to do yesterday)

unsound commented 3 years ago

@gingerbeardman So TCL does find matches for 0x8559, 0x866d and 0x8770? I don't see any of those in the macJapan.txt table that you linked above. What Unicode code points does it map those sequences to? (Decoding to UTF-32 should yield the exact Unicode code points.)

gingerbeardman commented 3 years ago

Good question! Of course I should have made a note which discs had those bytes. It took a while to find them again.

By "correct" I meant "it processes non-Japanese characters without error (and probably incorrectly)" but all Japanese characters appear as expected. Apologies.

Tcl just added a new encoding for me (based on the Apple/Unicode file JAPANESE.TXT which is subtly different to the existing macJapan.txt) but I'm yet to test this. I will when I figure out where/how to get the current Tcl source commits.

Looking into these byte combinations two out of three of them (0x866d, 0x8770) seem to be pairs of non-Japanese characters being misinterpreted as Japanese multi-byte combinations. Not sure what 0x8559 is supposed to be.

The existing macJapan Tcl encoding deals with them as follows:

0x8559

disc: MACLIFE04.ISO
filename: 41 半角⑧⑫\0x8559⑬→ 全角用
problem character: following ⑫
Tcl macJapan encoding: 0xC285 (0000 246B 0000 0085) incorrect
correct encoding scheme: ?
screenshot: from Japanese Macintosh System 7.5, shown as "unknown glyph rectangle"

0x866d

disc: HYPERLIB-1994-1-CD2.ISO
filename: Ümlåût Õmêléttè
problem character: Ü
Tcl macJapan encoding: 0xC286 (0000 FEFF 0000 0086) incorrect
correct encoding scheme: macRoman = Üm
screenshot: from English Macintosh System 7.5

0x8770

disc: MACBIN29.ISO
filename: Introducción rápida
problem character: á
Tcl encoding: 0xC287 (0000 0087) incorrect
correct encoding scheme: macRoman = áp
screenshot: from English Macintosh System 7.5

Supporting Files

full RAW bytes output of hfsutils' hls command
trimmed RAW bytes output of hfsutils' hls command
UTF-8 generated by Tcl encoding encodefrom
UTF-32 generated by iconv using above

Download: files.zip

gingerbeardman commented 3 years ago

Using https://github.com/gingerbeardman/encodings-tool I generated a list of what 0x8559 converts to in all supported encodings on modern macOS:

BYTES=`cat 8559.bin`; encodings $BYTES | grep -v "null" > file.txt

encodings-not-null-0x8559.txt

Not much help, perhaps, but worth a try.

gingerbeardman commented 2 years ago

Just seen this screenshot in the wild on a Japanese for sale site. Thought you might like to see it!?

Screenshot of a screenshot:

unsound / hfsexplorer

Extract data problem when using MacJapanese encoding due to out-of-range characters included in filenames #26

Supporting Files