Closed gingerbeardman closed 3 years ago
MACLIFE31
The error above is when it gets to this folder:
Here's the same error when I drill into the folder manually using the directory tree:
In classic Mac OS System 7 that folder contains one file:
We can see some interesting characters if I copy and paste the filename:
This problem, "byte 16" 0x7f57, refers the 7F character before the 57 (W).
The bytes as seen in a hex editor:
Of course there are two other similar instances in this one filename, and many others elsewhere on this, and across other discs that I have.
0x7F is "non-printable control character" DEL
from ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT
# Control character mappings are not shown in this table, following # the conventions of the standard UTC mapping tables. However, the # Mac OS Japanese encoding uses the standard control characters at # 0x00-0x1F and 0x7F.
So I think the DEL 0x7f character should be ignored.
My workaround in my text files is to strip the characters:
hls -1aR | perl -ne 's/[\x7f]//g;print $_;' > file.txt
Which gives me valid MacJapanese text afterwards. Woo! But only fixes some files. Boo!
Decided to blast through a bunch more ISOs with the help of some automation.
Bytes starting:
All generate errors with MacJapanese selected.
If you're interested I can share more files, info, etc.
Interestingly if I take the problematic text and convert it from MacJapanese to MacJapanese using macOS frameworks, the resulting text is free from strange characters.
I used https://github.com/andreberg/encodings-tool to do this.
So, it seems the macOS API for working with text does more than it might appear.
Thanks for debugging the issue this far, I'll check the provided example as soon as I have time.
@gingerbeardman I pushed some changes to the proposed
branch that should be helpful when decoding these out-of-place filenames. I don't quite understand how these characters ended up in MacJapanese-encoded filenames, or any filenames for that matter (as they are control characters / characters outside the character set). However as a solution we now fall back to MacRoman mapping for those characters that don't have a mapping in MacJapanese.
Just checkout the proposed
branch and it works wonderfully, thank you. I can only assume that's what Mac OS does when it encounters funkiness like this?
Regarding how these characters get in the filenames, Japanese is typed in a strange way involving presses of sequences of letter and modifier keys to achieve specific characters from each of their three alphabets. There were also third party input methods available that may have allowed typing invalid characters like this (DEL in particular is easy to type, but should not accepted/allowed/processed as text). I've managed to type some funky stuff but not reproduce one of these faulty filenames.
Next for me is to look at directly listing export. My goal from all of this!
Just to expand on this:
DEL in particular is easy to type, but should not accepted/allowed/processed as text
This is very easy to reproduce in a filename:
Screen recording:
https://user-images.githubusercontent.com/49612/139464296-8c1eb11f-4d54-4b1d-b808-a88e32b85ce7.mov
it looks like there are some Shift-JIS characters in these "MacJapanese" discs too 😩
From ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT
- A user-defined range using Shift-JIS code points 0xF040-0xFCFC, providing 2444 code points.
Currently these are incorrectly decoded as MacRoman.
This is an omission, I guess I assumed the table would cover all the mapped code points but you have to read the comments carefully as well:
# 1. Mapping the user-defined range
#
# The table below covers only the standard Mac OS Japanese encoding.
# It does not include mappings for the Shift-JIS user-defined range;
# this is mapped onto Unicodes 0xE000-0xE98B as follows:
# 0xF040-0xF07E -> 0xE000-0xE03E
# 0xF080-0xF0FC -> 0xE03F-0xE0BB
# 0xF140-0xF17E -> 0xE0BC-0xE0FA
# 0xF180-0xF1FC -> 0xE0FB-0xE177
# ...
# 0xFC40-0xFC7E -> 0xE8D0-0xE90E
# 0xFC80-0xFCFC -> 0xE90F-0xE98B
@gingerbeardman I pushed some fixes earlier today, please check if they improve things for your images.
Thanks, that's much improved!
A few oddities:
The repeated digits strike me as odd.
These point to the symbols/dingbats block of Shift-JIS, but there are no characters at this specific locations. No idea what should be done with those.
Java's own Shift-JIS implementation fails to decode those, but iconv
has possible matches:
$ printf "\x85\x59\n\x86\x6d\n\x87\x70\n" | iconv -f 'SHIFT_JISX0213' -t 'UTF-8'
Ã
ɚ̀
㎝
I just don't know where those matches come from...
Edit: Actually that's specifically for this X0213 version of Shift-JIS, which is an extension of regular Shift-JIS. The extension was introduced in 2004 so I doubt that HFS filenames are supposed to be interpreted in that charset. Even using MacJapanese through CoreFoundation doesn't find any match for these sequences.
It's a mystery.
Classic Macintosh System 7 displays them as unknown character glyphs.
...I'm wondering if the repeated digits could be mastering errors? So they'd be more like:
Just a thought.
Well... I can't imagine how that kind of error would have been introduced because it effectively would have mirrored the least significant 4-bit part of the previous byte into the most significant 4-bit part of the next byte, effectively shifting the high 4 bits to the low 4 bits of that byte. It just doesn't follow any logical error pattern that I've ever seen.
Interesting update: I found that Tcl
's encoding functions include MacJapanese (as "macJapan") and it deals with all of this without a single issue.
$ hls -1ablRN | ./convert2unicode.tcl
convert2unicode.tcl: https://gist.github.com/gingerbeardman/4a3b66236e018b72b32ca17953474e12
So I thought you might be interested in how Tcl does things.
Closer inspection seems to show Tcl doesn't get it all correct, it just fails silently. A future version of Tcl will show errors, see here.
Alright then I think we have exhausted all possibilities for now to try and decode those sequences. I'm closing this but feel free to reopen if you think there's anything that can be improved here.
That's fine by me.
All that remains is for me to say thank you for your attention and assistance on this.
Tcl uses the following Apple technique https://opensource.apple.com/source/tcl/tcl-10/tcl/tools/encoding/macJapan.txt
It gets it all correct as long as you operate in binary mode (which I was neglecting to do yesterday)
@gingerbeardman So TCL does find matches for 0x8559
, 0x866d
and 0x8770
? I don't see any of those in the macJapan.txt
table that you linked above.
What Unicode code points does it map those sequences to? (Decoding to UTF-32 should yield the exact Unicode code points.)
Good question! Of course I should have made a note which discs had those bytes. It took a while to find them again.
By "correct" I meant "it processes non-Japanese characters without error (and probably incorrectly)" but all Japanese characters appear as expected. Apologies.
Tcl just added a new encoding for me (based on the Apple/Unicode file JAPANESE.TXT which is subtly different to the existing macJapan.txt) but I'm yet to test this. I will when I figure out where/how to get the current Tcl source commits.
Looking into these byte combinations two out of three of them (0x866d, 0x8770) seem to be pairs of non-Japanese characters being misinterpreted as Japanese multi-byte combinations. Not sure what 0x8559 is supposed to be.
The existing macJapan Tcl encoding deals with them as follows:
0x8559
41 半角⑧⑫\0x8559⑬→ 全角用
0x866d
Ümlåût Õmêléttè
0x8770
Introducción rápida
Download: files.zip
Using https://github.com/gingerbeardman/encodings-tool I generated a list of what 0x8559 converts to in all supported encodings on modern macOS:
BYTES=`cat 8559.bin`; encodings $BYTES | grep -v "null" > file.txt
Not much help, perhaps, but worth a try.
Just seen this screenshot in the wild on a Japanese for sale site. Thought you might like to see it!?
Screenshot of a screenshot:
Reproduce:
Screenshot
Note:
Another ISO that has the same problem
An example that can be exported OK in MacJapanese
Aside:
hfsutils