UTF-8 filenames - Githubissues

Jonny007-MKD commented 8 years ago

Hi,

my test file was uploaded successfully :+1: Now I changed to real data and stumbled upon the umlauts in my filenames (I guess that's the problem):

2016-03-13 15:40:46,663 - iceshelf@140 - INFO - Processing "music" (/raid/Multimedia/Audio/)
2016-03-13 15:44:20,814 - iceshelf@398 - DEBUG - Processing file structure changes
2016-03-13 15:44:20,851 - iceshelf@173 - INFO - Content is not likely to compress (0% chance), skipping compression.
2016-03-13 15:49:01,309 - iceshelf@202 - INFO - Creating archive
2016-03-13 15:49:01,311 - shutil.py@524 - DEBUG - changing into '/raid/Temp/iceshelf/20160313-144046-80c9f'
2016-03-13 15:49:01,327 - shutil.py@376 - INFO - Creating tar archive
2016-03-13 15:53:54,889 - shutil.py@552 - DEBUG - changing back to '/home/osmc/build/iceshelf'
2016-03-13 15:53:54,891 - iceshelf@207 - INFO - Removing temporary copies of files
Traceback (most recent call last):
  File "./iceshelf", line 452, in <module>
    files = gatherData()
  File "./iceshelf", line 227, in gatherData
    json.dump(manifest, fp)
  File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
    for chunk in iterable:
  File "/usr/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 387, in _iterencode_dict
    yield _encoder(key)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 77: invalid start byte

These are some of my files:

$ ls -la /raid/Multimedia/Audio/ 
    drwxrwxr-x  30 osmc users  4096 Dec 11 15:55 Hörbücher
    drwxrwxr-x 507 osmc users 20480 Mar  2 19:02 Musik
$ ls -la /raid/Multimedia/Audio/Hörbücher/
    drwxrwxr-x   3 osmc users  4096 Nov 11 13:40 Jo Nesbø

mrworf commented 8 years ago

Should work now if you pull the latest version. Also added åäö and other fine unicode characters to the test script to weed out this. Please close if this solves your issue.

Jonny007-MKD commented 8 years ago

Now I've got quite the same error at another position. Sorry for consuming so much of your time!

Traceback (most recent call last):
  File "./build/iceshelf/iceshelf", line 412, in <module>
    gotall = collectSources(config['sources'])
  File "./build/iceshelf/iceshelf", line 153, in collectSources
    for root, dirs, files in os.walk(path):
  File "/usr/lib/python2.7/os.py", line 296, in walk
    for x in walk(new_path, topdown, onerror, followlinks):
  File "/usr/lib/python2.7/os.py", line 296, in walk
    for x in walk(new_path, topdown, onerror, followlinks):
  File "/usr/lib/python2.7/os.py", line 286, in walk
    if isdir(join(top, name)):
  File "/usr/lib/python2.7/posixpath.py", line 80, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 16: ordinal not in range(128)

mrworf commented 8 years ago

No worries, this is great, I'm just happy someone is willing to put up with all of this :) I'm using this for email so I don't get the more interesting combos like unicode in my filenames :D ... I know why these errors happen though, I just need to find all places where I miss unicode support (python2 deals poorly with it and some functions don't report unicode unless feed with unicode :-P )

mrworf commented 8 years ago

This is frustrating, I can't reproduce the issue you're seeing. Would you mind emailing me (github@sensenet.nu) the output of "find /raid/Multimedia/Audio/" so I can recreate the structure here? Also, just in-case, are you running the latest? Last commit headline is " Improved test script", just so nothing got messed up :)

Jonny007-MKD commented 8 years ago

I pulled again, just in-case, still the same issue. Mail is on the way. Sorry that I included no body ;)

mrworf commented 8 years ago

Hehe :) No worries, just wrote a script to recreate the structure locally here, hopefully this will allow me to see what the F is going on.

mrworf commented 8 years ago

AHHHH! Now it crashes here too :D Nice! Let's see what we can see

mrworf commented 8 years ago

ha@development:~/projects/iceshelf/tmp$ ../iceshelf config
First run, no previous checksums
Setting up the prep directory
Checking sources for changes
Processing "test" (raid/)
Creating archive
Creating tar archive
Removing temporary copies of files
Traceback (most recent call last):
  File "../iceshelf", line 469, in <module>
    files = gatherData()
  File "../iceshelf", line 213, in gatherData
    fileutils.deleteTree(config["archivedir"], True)
  File "/home/ha/projects/iceshelf/fileutils.py", line 9, in deleteTree
    for root, dirs, files in os.walk(tree, topdown=False):
  File "/usr/lib/python2.7/os.py", line 294, in walk
    for x in walk(new_path, topdown, onerror, followlinks):
  File "/usr/lib/python2.7/os.py", line 294, in walk
    for x in walk(new_path, topdown, onerror, followlinks):
  File "/usr/lib/python2.7/os.py", line 294, in walk
    for x in walk(new_path, topdown, onerror, followlinks):
  File "/usr/lib/python2.7/os.py", line 294, in walk
    for x in walk(new_path, topdown, onerror, followlinks):
  File "/usr/lib/python2.7/os.py", line 294, in walk
    for x in walk(new_path, topdown, onerror, followlinks):
  File "/usr/lib/python2.7/os.py", line 294, in walk
    for x in walk(new_path, topdown, onerror, followlinks):
  File "/usr/lib/python2.7/os.py", line 284, in walk
    if isdir(join(top, name)):
  File "/usr/lib/python2.7/posixpath.py", line 80, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 61: ordinal not in range(128)

mrworf commented 8 years ago

Urrgrgh... this is hopeless, python 2.7 does not like unicode, period... All file functions only works if you use bytes for paths and files, while json barfs on it sigh ... Converting it to unicode breaks filenaming, ignoring unicode fixes saving data, but breaks loading.

Jonny007-MKD commented 8 years ago

That's bad :( May the backport of pathlib2 help?

Is the unicode support in python 3 that much better? How much work would it be to upgrade to python 3? Can I help somehow? :)

mrworf commented 8 years ago

Right now all the backup stuff works, I just can't serialize it to/from JSON. So I'm looking into encoding filenames with something that JSON likes. Just ran out of time yesterday :) Hoping to have it ready later this week, work is for some reason taking up my time now ;)

mrworf commented 8 years ago

This is very interesting. Linux uses bytes to represent filenames, this means that you can, in a filesystem, have both UTF-8 encoded filenames as well as Latin1. Neither encoding is compatible with each other and this is the reason your backup is failing. You have a file which was created using Latin1 or some other encoding which causes the issue.

I know, it sounds odd, but it's true :) ... Checkout the files under "/raid/Multimedia/Audio/Musik/Yo-Yo Ma/Play Classical/" ... Some of the files there (and other places) will have question marks instead of the character you'd expect.

I've pushed a new version which is resilient to this situation and skips it with a warning. Renaming these files will correct the problem.

Jonny007-MKD commented 8 years ago

That's really interesting! Those are files that I downloaded from Google Music and then uploaded via FTP (vsftpd). Here's a link to a similar problem.

[...] server's charset is UTF-8 [...] windows' [...] encoding is GB2312

The mentioned command also detects which filenames are already UTF-8 and doesn't destroy them :)

convmv -f gb2312 -t utf8 -r --notest * -r

Thank you!

EDIT: great, after I ran the command, my filename changed from this 07 - Unaccompanied Cello Suite No. 1 in G Major, BWV 1007 Pr?lude.mp3 to this 07 - Unaccompanied Cello Suite No. 1 in G Major, BWV 1007 Pr�lude.mp3 Perhaps GB2312 wasn't correct for those sigh

mrworf commented 8 years ago

I'd guess it's Latin1 or iso-8859-1 since it's german classical music. And the character you're missing is é :) Please close this issue if you're happy with the resolution on my end.

Jonny007-MKD commented 8 years ago

Well, as I looked up now NTFS uses UTF-16 (wchar) for the filenames, but this would've looked different on Unix (I suppose). So FileZilla will have done some conversion to whatever, probably Latin1. I changed the FileZilla settings somewhat after the installation.

Anyway, your solution is excellent. I haven't seen such a warning after my conversion, now it's uploading :) I'm missing a progress indicator, but in the end iceshelf shall run in the background and the indicator isn't needed then. Thanks!

mrworf commented 8 years ago

Yeah, NTFS is all unicode, only linux allows you to put bytes and call it whatever you want :)

mrworf / iceshelf

UTF-8 filenames #2