keeleysam / macfuse

Automatically exported from code.google.com/p/macfuse
Other
0 stars 0 forks source link

Unicode-name weirdness #139

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Get a Linux computer :), with UTF-8 as the locale encoding. Mount a
FAT32 partition with the utf8 option. Create some directories on that
partition that contain accented characters in their names, for instance
e-grave: lumière
2. Mount the filesystem containing those directories on a Mac OS X system
through sshfs.
3. Try to copy the directories to the desktop

What is the expected output? What do you see instead?
Normally they should be copied. In my case, it didn't work; the cursor had
the "forbidden" circle attached.

What version of the product are you using? On what operating system?
Mac OS X 10.4.9 with sshfs 0.1.0 on an 13" MacBook, connecting to a Linux
(Ubuntu Feisty Fawn) computer.

Please provide any additional information below.
I tried looking at those directories in the terminal. LS shows "?" instead
of the accented characters. However, tab-completion generates escape
characters (\303\250 for e-grave) and cp can copy them. Also, Finder
windows display the files correctly, so I presume there's a localized bug. 

Original issue reported on code.google.com by bogd...@gmail.com on 2 Apr 2007 at 12:13

GoogleCodeExporter commented 9 years ago
By the way, copying the other way (from the OS X desktop to the sshfs-mounted
partition) works even for files/directories with accented characters.

Original comment by bogd...@gmail.com on 2 Apr 2007 at 12:15

GoogleCodeExporter commented 9 years ago
Unfortunately, this is a known problem that isn't fixed in MacFUSE.

First a bit on Unicode normalization: For some characters, there are multiple 
ways to
represent them. For example, è can be represented in a precomposed manner
("\303\250", two bytes), or in a decomposed manner ("e\314\200", three bytes). 
In
decomposed unicode, combining versions of the accents following the base-letter 
are
preferred, whereas in precomposed unicode, the characters representing both the
base-letter and the accent are preferred. A "normalization" process on a unicode
sequence converts all characters in the sequence to the preferred normalization
(either pre- or decomposed unicode).

The problem is that HFS+ enforce filenames to be in decomposed unicode (more 
exactly:
Unicode Normalization Form D), but other OSes (e.g. Windows) prefer (but do not
enforce) precomposed unicode (more exactly: Unicode Normalization Form C). The
filenames on your windows drive (e.g. "lumière") are precomposed. Linux passes 
them
to sshfs, which passes them to MacFUSE, which passes them to Mac OS X all
precomposed. Mac OS X can display them, however you can't copy them to your HFS+
drive because HFS+ requires decomposed filenames.

The "obvious" solution (at first) seems that MacFUSE should decompose filenames 
that
come in from fuse daemons (like sshfs). However this has some drawbacks. First, 
the
fs daemon may contain a mixture of pre- and de-composed filenames, and we don't 
want
to have to keep a list (in memory or elsewhere) of which ones we've normalized.
Second, we this wouldn't handle the pathological case of a directory containing 
both
pre- and de-composed filenames (e.g., "lumie`re" and "lumière", where ` 
represents
the composing `). Yes, linux and windows allow that. Furthermore, decomposing
filenames causes them to take up more unicode characters, which may cause the 
length
of a filename to go over the 255 character limit, which is hard-coded into the 
Mac OS
X kernel.

As a result, we decided for now to leave it up to the fs daemon to properly 
handle
unicode normalization. This means that problems like you face are known failure 
cases.

We'd like to fix this in MacFUSE somehow, but it's unclear which is the best 
way.

Original comment by andrewde...@gmail.com on 2 Apr 2007 at 2:02

GoogleCodeExporter commented 9 years ago
You're not going to be able to avoid keeping a list, so why not just do that? 
You can
leave the multiple similar names case, which will probably never happen, as a 
failure
case instead of failing for the large large number of non-English users who'd 
get
screwed by the current situtation.

Original comment by paracel...@gmail.com on 9 Jun 2007 at 3:47

GoogleCodeExporter commented 9 years ago
Actually, I think we can avoid keeping a list, tho it has drawbacks. See:

http://code.google.com/p/macfuse/wiki/DesignDocFilenameEncodingSupportForMacFUSE

(I'd welcome your input)

Original comment by andrewde...@gmail.com on 9 Jun 2007 at 5:47

GoogleCodeExporter commented 9 years ago
I guess that would work, but it also fails instead of doing the right thing 
when it
encounters problems.

To really do the right thing, you should consider taking Mozilla's 
UniversalDetector,
which can auto-detect a large number of encodings, and using it to 
automatically find
which encoding is used. I've used this for similar purposes in my program The
Unarchiver, which (obviously) unarchives files from a large variety of operating
system locales.

Original comment by paracel...@gmail.com on 9 Jun 2007 at 6:05

GoogleCodeExporter commented 9 years ago
I agree with paracelsus first comment. I'm having trouble with ntfs-3g in Mac 
because of this. Windows and 
Linux both create composed characters and actually accept decomposed 
characters, but Mac creates 
decomposed characters and cannot really open filenames with composed characters 
because it translates 
them somewhere along the way to decomposed.

So, since Windows, the native host OS for NTFS, uses composed characters, you 
could just "translate" in a way 
that Mac sees decomposed but the real filesystem uses composed. I don't know 
about other filesystems, but 
this seems very straight away. In fact, the built-in read-only Mac ntfs does 
the translation.

The problem is how to handle real decomposed filenames, specially if an 
equivalent one exists with composed 
characters, but that should be a problem only to previous MacFuse users, not to 
newcomers. You could handle 
it the same way ntfs-3g handled unknown encoding names, i.e. log an error and 
ignore it. And/or make a tool 
that renames decomposed to composed, perhaps trailling with underscore if an 
equivalent composed filename 
already exists.

The built-in Mac ntfs doesn't handle decomposed and equivalent filenames much 
gracefully. The decomposed 
form filenames get listed with a normal "ls" or in Finder (icon or column 
views), but not if you "ls -l", "cat" or 
click on the file in Finder (it "disappears") or list it in list view. If an 
equivalent filename exists, the attributes 
and contents ("ls -l", "cat", Finder's details and preview) are those of the 
file with the equivalent composed 
name.

Anyway, it seems MacFuse should have made this translation from the start to 
avoid this type of confusion. 
The problem would still be there, but it would be mitigated and would have to 
be exploited in some other 
much more complicated way. I guess this holds true for sshfs and others.

So I vote for always translating to composed form for actual filesystem 
operations, or if you see it's more fit, 
always translating to decomposed form for the Mac (or both, I don't really know 
how much implications there 
are).

Original comment by asstolav...@hotmail.com on 18 Jul 2007 at 4:25

GoogleCodeExporter commented 9 years ago
I wonder: could this be related to an inability to use sshfs to access a Windows
machine running OpenSSH when the username contains a space?

Original comment by matta...@gmail.com on 25 Aug 2007 at 6:43

GoogleCodeExporter commented 9 years ago

Original comment by dominion...@gmail.com on 19 Sep 2007 at 9:44

Attachments:

GoogleCodeExporter commented 9 years ago
Where may I find a "tool that renames decomposed to composed characters" ?
Please help me, I have hundreds of files that I'd like to rename with composed
characters.

Original comment by pierre.g...@gmail.com on 16 Nov 2007 at 11:24

GoogleCodeExporter commented 9 years ago
If you are looking for a solution that runs on Linux, I think you want convmv. 
http://www.j3e.de/linux/convmv/

I don't know of a solution that runs on Windows. There is no solution for Mac 
because all filenames are stored as 
decomposed characters.

Original comment by a...@gmail.com on 16 Nov 2007 at 5:43

GoogleCodeExporter commented 9 years ago
Thanks adlr !
I downloaded convmv and ran it on Ubuntu :
./convmv -r -f utf-8 -t utf-8 --nfc --notest /media/LaCie/

Great!

Original comment by pierre.g...@gmail.com on 24 Nov 2007 at 1:33

GoogleCodeExporter commented 9 years ago
I have the proposal to solve this problem. 

If we use sshfs with "modules option", then we may solve this problem.

command example:
    $ sshfs user@sftpserver:/dir /mount/point/ -o
modules=iconv,from_code=UTF-8,to_code=UTF-8-MAC 
    (from_code is sftpserver side charset, to_code is sshfs side charset)

I checked that this function was effective, on Linux Box. 
 (CentOS4.2 kernel 2.6.9-55.0.12.EL/FUSE 2.7.1/SSHFS 1.8)

But now, I know following conditions:

1. GNU libiconv, customized by Apple, have "decomposed unicode 
encoding(UTF-8-MAC)".

2. "FUSE 2.7.x" added new feature "Add filename charset conversion module".

3. I seemed that "MacFuseCore 1.1.0" or "MacFusion 1.2 Beta 3" is not linked
"libiconv"(on Mac OS X 10.4.x).

I hope to solve this problem.

Thank you.

Original comment by Zeta...@gmail.com on 19 Dec 2007 at 11:00

GoogleCodeExporter commented 9 years ago
A bit more weirdness: 

I have noticed that it is possible to copy the files using the shell. I can 
either
SSH into the Mac or open a Terminal, then just do a simple "cp" from the
sshfs-mounted directory to the local disc (eg, the Desktop), and it works with 
no
comments. The names are decomposed on the fly.

I suppose it is "cp" who does it. What I don't understand is why the same thing
doesn't happen with the graphic shell. Could one of the devs try to trace the 
two
cases and see if sshfs is called differently?

Original comment by bogd...@gmail.com on 8 Jan 2008 at 12:04

GoogleCodeExporter commented 9 years ago
I've tried;
    ntfs-3g /dev/disk1s1 x -omodules=iconv,from_code=UTF-8,to_code=UTF-8-MAC
But got nothing significant.
NFCed character still appears in filename so to deny Finder copy them.

MacFUSE 1.3.1 + ntfs-3g 1.1120, Leopard.
What did I wrong?

Original comment by kei...@gmail.com on 9 Jan 2008 at 8:26

GoogleCodeExporter commented 9 years ago
Character encodings should almost certainly be handled by the filesystem 
itself, not
generally by MacFUSE.  No matter how good the heuristic is, guessing character
encodings is almost always a bad idea.

SSHFS has access to the local and remote environment, and should be able to 
determine
the character encodings to translate between.  I do hope this issue is regarded 
as a
bug in SSHFS, not MacFUSE (unless I'm missing something about MacFUSE that 
makes it
also suspect).

Original comment by forest...@gmail.com on 30 Apr 2008 at 1:18

GoogleCodeExporter commented 9 years ago
I should mention that it is also reasonable for end users to enable the iconv 
module
if the filesystem doesn't handle character encodings at all.  I just really 
want to
oppose the idea of MacFUSE automatically guessing character encodings by 
default.

Original comment by forest...@gmail.com on 30 Apr 2008 at 1:19

GoogleCodeExporter commented 9 years ago
When will someone come with a real solution? this thing it's kind of annoying. 
I know
I can just delete special characters from my files and make it easier, or make 
a new
fat32 partition for perhaps my documents but It takes some time...
I hope the solution comes quickly... I would really love to help... but I don't 
know
much about all this programing languages used... XP 

Original comment by esteban...@gmail.com on 17 May 2008 at 4:12

GoogleCodeExporter commented 9 years ago
The 'coder way' solution :

1/ Assuming MacOS X local char encoding for filename is UTF8-MAC (known as
UTF8-Decomposed form) 

2/ Assuming Fuse embbeded File System char encoding for filename is 
UTF8-Composed
Form (other char encoding will works too)

3/ let see iconv() man with 'man 3 iconv' (libiconv is located under /usr/lib 
under
macosx)

4 /in your custom FileSystem source code, for every incomming fuse callback that
provide a full path to seek (this is the case for all callback but readdir()),
concider using :

transcoder = iconv_open("UTF-8","UTF-8-MAC");
 followed by a
iconv(transcoder, &path, &srcBytesCount, &outpath, &dstBytesCount)

then work with reencoded 'outpath' string to seek informations ...

5/ in the readdir() callback

concider using : 

transcoder = iconv_open("UTF-8-MAC","UTF-8");
 followed by a
iconv(transcoder, &UTF8_filename, &srcBytesCount, &UTF8_MAC_filename, 
&dstBytesCount)

before calling
filler() function with your freshly reencoded 'UTF8_MAC_filename' string.

enjoy !

Original comment by franck.b...@gmail.com on 19 Jun 2008 at 1:14

GoogleCodeExporter commented 9 years ago
> The 'coder way' solution :
> ...

The coder would be well advised to look at the source code of the open source 
MacFUSE Core. Since MacFUSE 1.0 
(released October 2007), the user-space library supports stacking of file 
system modules. One of the built-in 
modules is "iconv". See lib/modules/iconv.c in the user-space library source. 
The module takes two arguments: 
a "from" encoding name and a "to" encoding name. Then, for each incoming 
operation, the library automatically 
does what you are suggesting.

Original comment by si...@gmail.com on 19 Jun 2008 at 4:32

GoogleCodeExporter commented 9 years ago
Unfortunatly since '-omodules=iconv,from_code=UTF-8,to_code=UTF-8-MAC' is not
understood by macfuse (v1.5.1) fuse_main() function, the 'coder way' seems to 
be the
only solution, for now.

Original comment by franck.b...@gmail.com on 20 Jun 2008 at 8:13

GoogleCodeExporter commented 9 years ago
Can you clarify what exactly doesn't work?

I tried the following in the 1.5.1 tree and the arguments are received by the 
iconv module as expected:

$ ./hello /tmp/hello -omodules=iconv,from_code=UTF-8,to_code=UTF-8-MAC

Besides, the iconv module has defaults for both from_code and to_code 
arguments. If you don't specify 
from_code, it should use UTF-8. If you don't specify to_code, it should use the 
value of $LC_CTYPE. Does that 
not work?

Original comment by si...@gmail.com on 20 Jun 2008 at 3:52

GoogleCodeExporter commented 9 years ago
Performin more test, I found strange behaviour...

I usually call fuse_main using following argv:

main /Volumes/Point_A -f -onoappledouble -ovolname=Point
-ovolicon=/Users/bonin134/Documents/Projets/Point/Build/MacOs/XCode/Drive/Debug/
Point.app/Contents/MacOS/../Resources/Point.icns

To use iconv module, I tried this argv without success:

main /Volumes/Point_A -f -omodules=iconv,from_code=UTF-8,to_code=UTF-8-MAC
-onoappledouble -ovolname=Point
-ovolicon=/Users/bonin134/Documents/Projets/Point/Build/MacOs/XCode/Drive/Debug/
Point.app/Contents/MacOS/../Resources/Point.icns

output console error is 'fuse: unknown option `from_code=UTF-8''

!!!! BUT !!!, following command line seems to work :

main /Volumes/Point_A -f -omodules=iconv,from_code=UTF-8,to_code=UTF-8-MAC
-onoappledouble -ovolname=Point

don't see why, but there is a problem between -ovolicon and -omodules...

Original comment by franck.b...@gmail.com on 23 Jun 2008 at 1:34

GoogleCodeExporter commented 9 years ago
> !!!! BUT !!!, following command line seems to work :
> don't see why, but there is a problem between -ovolicon and -omodules...

"-ovolicon=/path/to/icon" is a special option: it's a convenience shorthand for 
"-
omodules=volicon,iconpath=/path/to/icon". This works fine if you are using no 
other modules, but if you are, this 
wouldn't work because the library wants modules specified as 
"-omodules=M1:M2:...:Mn". So, in that case, you will 
have to use the longhand form. For example:

"-omodules=iconv:volicon,iconpath=/path/to/icon,from_code=UTF-8,to_code=UTF-8-MA
C"

The order of arguments doesn't matter.

I acknowledge that this should at least be documented. I don't expect end users 
to figure this out. *But*, you sound 
like a developer, so why do black-box debugging *and* reinvent/reimplement the 
functionality of the iconv module 
within your file system? It's easy enough to look at the MacFUSE source.

Original comment by si...@gmail.com on 24 Jun 2008 at 12:51

GoogleCodeExporter commented 9 years ago
thanks, now it works perfectly.

>so why do black-box debugging *and* reinvent/reimplement

because when I found that my UTF-8-D char problem might be solved by
'omodules=iconv...'option, I couldn't imagine it could interfer with -ovolicon 
option
I thought of a syntax problem from myself or from google help I found.
then I see some people having the same problem, so I decided to use libiconv by 
myself.

Any way, where should I post 'developper side' questions about libfuse usage 
that are
not issues about libfuse ?

Original comment by franck.b...@gmail.com on 24 Jun 2008 at 1:08

GoogleCodeExporter commented 9 years ago
> Any way, where should I post 'developper side' questions about libfuse usage 
that are
not issues about libfuse ?

There's official macfuse forum is for both users and developers.

http://groups.google.com/group/macfuse-devel

Original comment by si...@gmail.com on 24 Jun 2008 at 9:37

GoogleCodeExporter commented 9 years ago
This has been open forever so I'm finally marking it as "WontFix".

Either use the iconv module that's built into the user-space library, or handle 
it within the user-space file 
system.

Original comment by si...@gmail.com on 12 Nov 2008 at 5:02

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
This should be threated as Open, as there are viable solutions. On main wiki, 
there is a proposal that is related to 
this bug report: 
http://code.google.com/p/macfuse/wiki/FILENAME_ENCODING_PROPOSAL

Original comment by brod...@gmail.com on 3 May 2009 at 8:48

GoogleCodeExporter commented 9 years ago
Thanks. I had some songs with è in the title that Mac OSX would not copy from 
a Linux drive mounted over sshfs. Running this on Linux fixed the problem:

sudo apt-get install convmv
convmv -r -f utf-8 -t utf-8 --nfd --notest /path/to/music

Original comment by mat.sola...@gmail.com on 29 Feb 2012 at 9:11

GoogleCodeExporter commented 9 years ago
Hi, just for the sake of discussion, I was able to solve this issue based on 
the previous comments.  My use case is simple, after 13+ years of using Linux 
I'm moving to OSX and needed to replicate my previos config with Ubuntu. I 
access using sshfs to the Company's server and I was having the character 
issue, OSX told me that it was unable to find the application whenever I tried 
to open a file with a strange character.

I was able to properly mount the sshfs resource with this line

sshfs -p XXX myuser@myserver:/share /Volumes/share 
-orw,nodev,allow_other,reconnect,uid=XXXXX,gid=XXX,max_read=65536,compression=ye
s,auto_cache,no_check_root,kernel_cache,umask=0002,workaround=rename,auto_cache,
reconnect,defer_permissions,noappledouble,negative_vncache,intr,modules=iconv,fr
om_code=UTF-8,to_code=UTF-8-MAC,volname=share

The modules=iconv,from_code=UTF-8,to_code=UTF-8-MAC part was the one who did 
the trick.

best regards

Original comment by amuji...@gmail.com on 4 Aug 2014 at 12:59