Fuse paths should be encoded to UTF-8

GoogleCodeExporter commented 9 years ago

Fuse deals with UTF-8 paths. Since those paths arrive from C code, at least in 
fuse3, they should be encoded to unicode strings before arriving at user code.

I can provide a patch for this if my assumptions are correct.

Original issue reported on code.google.com by anacrolix@gmail.com on 20 Feb 2011 at 3:14

GoogleCodeExporter commented 9 years ago

I can create paths with unicode names without problems with fuse.py. Do you 
have any specific cases were it fails?

Original comment by verigak on 6 Mar 2012 at 3:44

GoogleCodeExporter commented 9 years ago

I think what anacrolix meant is that e.g. Operations.getattr gets a bytes 
instance rather than a str instance for a path.

Original comment by antic...@gmail.com on 6 Mar 2012 at 10:14

GoogleCodeExporter commented 9 years ago

That's right.

Original comment by anacrolix@gmail.com on 7 Mar 2012 at 7:02

GoogleCodeExporter commented 9 years ago

My patches on github.com: https://github.com/terencehonles/fusepy probably fix 
what you need (Ideally my patches will be pushed upstream)

Original comment by Terence....@gmail.com on 24 Apr 2012 at 8:12

GoogleCodeExporter commented 9 years ago

This is wrong! User code should be written to deal with bytes, not the other 
way around. On POSIX operating systems, file paths are NOT specified as being 
UTF-8 or any specific Unicode encoding. The only correct way to deal with 
filenames on Unix is to treat them as byte strings. Most software I've seen 
treats them as UTF-8, but at the file system level, they are binary strings and 
any FUSE implementation would be broken if it didn't support non-UTF-8 
filenames.

In other words, I need to be able to cd to a FUSE-mounted file system, open 
Python 3, and type this:

>>> import os
>>> open(b'd\xe9j\xe0_vu.txt', 'w').close()
>>> os.listdir(b'.')
[b'd\xe9j\xe0_vu.txt']

In most shells, if you ls this file, it will display as d?j?_vu.txt. But it is 
a perfectly valid Latin-1-encoded filename. If fusepy encoded the filename as a 
Unicode string before sending it to the user code, it would either throw an 
exception in this case, or corrupt the filename.

I have tested Terence's fork of fusepy and it breaks this assumption. He added 
an 'encoding' argument to the FUSE constructor, and then decodes all the bytes 
values to strs with this encoding before giving them to the user-supplied 
operations, and encodes all strs supplied by user code before giving them back 
to the operating system. Unfortunately, it isn't a correct solution to simply 
say "pick an encoding before you start". File systems must be able to support 
different files with different encodings on their names.

If you run Terence's version of memory.py and then perform my above example, 
you get this:

Traceback (most recent call last):
  File "fuse.py", line 402, in _wrapper
    return func(*args, **kwargs) or 0
  File "fuse.py", line 410, in getattr
    return self.fgetattr(path, buf, None)
  File "fuse.py", line 640, in fgetattr
    attrs = self.operations('getattr', path.decode(self.encoding), fh)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2: invalid 
continuation byte

Alternatively, create the b'd\xe9j\xe0_vu.txt' file somewhere on a normal 
drive, and then run Terence's version of the loopback.py example on the 
directory containing that file. Attempting to 'ls' the directory results in 
this exception:

Traceback (most recent call last):
  File "fuse.py", line 402, in _wrapper
    return func(*args, **kwargs) or 0
  File "fuse.py", line 586, in readdir
    if filler(buf, name.encode(self.encoding), st, offset) != 0:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 
1: surrogates not allowed

The only solution is to make fuse.py deal with bytes throughout, then change 
all of the examples to also deal with bytes. I will upload my patch for this on 
the other bug (Issue 36).

Original comment by matt.gi...@gmail.com on 14 Jul 2012 at 1:03

rharder / fusepy

Fuse paths should be encoded to UTF-8 #32