thejoelpatrol / fusehfs

Update of FuseHFS for macFUSE on macOS 12 Monterey
28 stars 8 forks source link

Character encodings may not be respected #4

Open gingerbeardman opened 1 year ago

gingerbeardman commented 1 year ago

Following on from issue #2

Previously (hmm, I'm trying to think when exactly? a long time ago!) I could set my Mac to Japanese and reboot, mount an HFS disc that uses MacJapanese character encoding and see the filenames as intended. Reboot was essential, login was not enough.

Such foreign discs are tricky as they contain filenames in multiple character sets. Files may have been copied from other discs or downloaded from the internet, so could contain many different encodings. Encodings are not stored anywhere: they have to be set manually, calculated using heuristics or some other map, or simply assume it.

I have found that Tcl has good support for Apple encodings, most of them written by Apple themselves back in the mid-1990s when this stuff was still very much current. Though macOS should respect and display the original encoding if the characters are correct and the system language is set correctly.

Another gotcha is that bugs/omissions in Japanese input methods (helper apps that assist typing of complex such script using multiple alphabets) allowed non-displayable characters to be typed in filenames! Example: when renaming a file in Finder, pressing the Delete key on an Extended Keyboard would insert that invisible character rather than deleting anything. This means that filenames can be quite dirty and contain invalid characters, which I guess should be resolved in some way or simply ignored?

If you need more details please ask as I have the info in my notes that I can dig out.

Anecdotes

Sample HFS images

disk images that have a mix of MacRoman and MacJapanese:

This one looks 99% MacRoman, with a minimal file or two with MacJapanese names:

I have hundreds more discs of this type.

thejoelpatrol commented 1 year ago

I'm not sure macOS supports MacJapanese any more. The closest I can find with $ iconv -l is SHIFT_JISX0213. Forcing that encoding for the whole volume results in this:

image

AFAICT from Google Translate these appear to be roughly sensible filenames.

We don't want to force this manually globally, but even forcing one encoding that we decide via heuristic or settings/preferences or whatever would be better than assuming everything is MacRoman, as we are doing now in the absence of a command-line flag. We could do something like you mention, getting the system language and choosing the appropriate classic Mac encoding based on that, e.g. setting MacJapanese if you set your language to Japanese, MacHebrew if you are set to Hebrew, or Mac OS Thai, Mac OS Ukrainian, etc. But not all of these contain ASCII in the first 7 bits, so you won't be able to make any sense of MacRoman disks if you are set to Thai or Ukrainian, for example, so that's not great. Actually it seems these charsets do generally include ASCII in the first 7 bits so it might be workable, but a pain if you ever want more than one language plus ASCII.

I wonder if a simple GUI application would be useful here, to drag and drop a disk image on it and you can set the language/encoding there for people who don't want to use the command line.

I don't know what kind of heuristic could detect files with multiple encodings per volume, though. That seems real tough. Would those have been handled correctly by classic Mac OS?

In the interim, if you would like to use a volume that you know has a particular encoding, you can mount it manually. Open it in Disk Utility, and unmount the volume but do not eject the image. Get the disk number, e.g. /dev/disk2s2. Then, run this:

$ /Library/Filesystems/fusefs_hfs.fs/Contents/Resources/mount_fusefs_hfs --encoding=${ENCODING_NAME} ${DISK_NUMBER} ${MOUNTPOINT} eg: $ /Library/Filesystems/fusefs_hfs.fs/Contents/Resources/mount_fusefs_hfs --encoding=SHIFT_JISX0213 /dev/disk2s2 /Users/joel/mnt

gingerbeardman commented 1 year ago

MacJapanese is closely related to SHIFT-JIS but they're not the same, see here. That said, it may be close enough to be a workable solution.

I think your suggestions are all good. If the encoding could be remembered between mounts that would be great.

Or maybe the user could add something to the filename that would clue fusehfs in to the required encoding? That way nothing would need to be stored on disk.

I'll try manual mounting soon.

d235j commented 1 year ago

It looks like iconv doesn't support MacJapanese but CoreFoundation does — see https://developer.apple.com/documentation/coreservices/1399915-encoding_variants_for_macjapanes. The downside of rewriting the character encoding code using CF is that it would make fusehfs less portable to Linux.

I'm guessing something is stored on disk indicating encoding. Will need to investigate.

gingerbeardman commented 1 year ago

I'm guessing something is stored on disk indicating encoding. Will need to investigate.

I don't believe there is, but I don't have any references to cite. The language of the host OS is responsible for interpreting the filenames according to its default encoding. Very old school.

That said, I would be interested to see what you find!

joevt commented 1 year ago

There's a text encoding hint in the Finder Info of the Master Director Block. Maybe it's related? See dumpencoding at: https://gist.github.com/joevt/a99e3af71343d8242e0078ab4af39b6c See GET_HFS_TEXT_ENCODING at: https://github.com/apple-oss-distributions/hfs/blob/hfs-627.40.1/mount_hfs/mount_hfs.c

There's HFS Encoding kexts for various macOS versions from Sierra to Catalina at: /System/Library/Filesystems/hfs.fs/Contents/Resources/Encodings/ HFS_MacArabic.kext HFS_MacCentralEurRoman.kext HFS_MacChineseSimp.kext HFS_MacChineseTrad.kext HFS_MacCroatian.kext HFS_MacCyrillic.kext HFS_MacGreek.kext HFS_MacHebrew.kext HFS_MacIcelandic.kext HFS_MacJapanese.kext HFS_MacKorean.kext HFS_MacRomanian.kext HFS_MacThai.kext HFS_MacTurkish.kext

I think there might be source code at least for converting MacJapanese? https://github.com/apple-oss-distributions/hfs/blob/main/hfs_japanese/hfs_japanese.kmodproj/JapaneseConverter.c And also MacRoman: https://github.com/apple-oss-distributions/hfs/blob/main/hfs_encodings/hfs_encodings.c

The documentation archive has a note about an Encoding popup in the Finder's Get Info window for an HFS Standard volume: https://developer.apple.com/library/archive/qa/qa1173/_index.html#//apple_ref/doc/uid/DTS10001705 What does that popup look like? Is it changing the value in the Master Director Block?

More about text encodings for HFS Plus: https://developer.apple.com/library/archive/technotes/tn/tn1150.html#//apple_ref/doc/uid/DTS10002989 It implies that HFS Standard does not have the per-file or per-folder text encoding settings that HFS Plus has and that the text encoding "varies depending on how the system software was localized and what language kits are installed".

d235j commented 1 year ago

Looks like if it exists at all, it’s stored in the Finder info word in the MDB? https://github.com/apple-oss-distributions/hfs/blob/4e3719273a0c670ef4aa7c77bb421c89f3473e14/mount_hfs/mount_hfs.c

gingerbeardman commented 1 year ago

Interesting!

Tcl has good conversion routines and encoding tables which were written by Apple themselves.