windows altında os.walk ile encoding sorunu

laucianexones commented 7 years ago

dosyanın adında türkçe karakter olarak ı ve ğ var. aynı kodu linux (windows, ubuntu/bash) ile çalıştırınca sorunsuz çalışıyor. Windows altında çalıştırınca dosya adında saçmalıyor. dosya adını okurken ı harfini i, ğ harfini g diye okuyor, utf8 çevirirken bu 2 harfe dokunmuyor. haliyle dosyanın adı yanlış oluyor.

C:\Users\kc\Documents\Python-Projects\encryptBackup>python listTree.py Windows-10-10.0.15063 mbcs ascii cp850 testFolder/ adindan Tr karakter olan bir dosya-³þ÷-ig.txt 'adindan Tr karakter olan bir dosya-\xfc\xe7\xf6-ig.txt' <type 'str'> adindan Tr karakter olan bir dosya-┬│├¥├À-ig.txt 'adindan Tr karakter olan bir dosya-\xc2\xb3\xc3\xbe\xc3\xb7-ig.txt' <type 'str'> C:\Users\kc\Desktop\testFolder\adindan Tr karakter olan bir dosya-┬│├¥├À-ig.txt has type <type 'str'> file C:\Users\kc\Desktop\testFolder\adindan Tr karakter olan bir dosya-┬│├¥├À-ig.txt not found

os.walk fonksiyonuna verdiğin değer Unicode olursa, dosya isimlerini de unicode döndürür diyor ama yalan, charmap'de bulamadım diye bağarıyor. https://stackoverflow.com/questions/1052225/convert-python-filenames-to-unicode

Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode objects. Undecodable filenames will still be returned as string objects.

EyeCon commented 7 years ago

Ben aynı sorunu yaşamıyorum nedense, Windows'ta ConEmu'da denedim:

EyeCon@HERON2 C:\temp1\keremdeneme

dir /s Volume in drive C is OS

Directory of C:\temp1\keremdeneme

2017-08-06 00:54
. 2017-08-06 00:54 .. 2017-08-06 00:55 at_ğüp1 2017-08-06 00:54 ĞÜŞİÖÇ1 0 File(s) 0 bytes

Directory of C:\temp1\keremdeneme\at_ğüp1

2017-08-06 00:55
. 2017-08-06 00:55 .. 2017-08-06 00:55 7 test1 2017-08-06 00:55 7 test2_ĞÜ 2 File(s) 14 bytes

Directory of C:\temp1\keremdeneme\ĞÜŞİÖÇ1

2017-08-06 00:54
. 2017-08-06 00:54 .. 0 File(s) 0 bytes

Total Files Listed: 2 File(s) 14 bytes 8 Dir(s) 110,016,847,872 bytes free

> ipython
Python 3.6.2 |Anaconda custom (64-bit)| (default, Jul 20 2017, 12:30:02) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import os

In [2]: for root, files, dirs in os.walk(r"."):
   ...:     print(root, files, dirs)
   ...:
. ['at_ğüp1', 'ĞÜŞİÖÇ1'] []
.\at_ğüp1 [] ['test1', 'test2_ĞÜ']
.\ĞÜŞİÖÇ1 [] []

In [3]:

laucianexones commented 7 years ago

python3 olması dışında bir fark göremiyorum. os.walk argümanını hem r ile (str olarak) hem de u ile (unicode olarak) verdim..değişen bir şey olmadı

bu arada, ben yorum olarak cevap yazınca sana notifikasyon geliyor mu?

EyeCon commented 7 years ago

Geliyormuş, bu vesileyle görmüş oldum.

laucianexones commented 7 years ago

encryption için yazdığın kodda transportsafe diye bir fonksiyon vardı. kullanıcıdan aldığı şifreyi utf8 olarak encode ediyordu. aynı şeyi burada tekrar etmek istiyorum ama olmuyor ve os.walk ile dosya isimlerini okurken aldığım hatayı alıyorum. ne olduğunu anlamış değilim.

burada string i oluşturuyorum - unicode değil. Nasıl encode edildi şu anda? hangi codec? sistem standartını diye okumuştum geçen gün, doğru mu?

>>> s = "ä"
>>> s
'\xe4'
>>> print s, type(s)
ä <type 'str'>

utf8 olarak encode veya decode ettirmiyor.

>>> s.encode("utf-8")

Traceback (most recent call last):
  File "<pyshell#109>", line 1, in <module>
    s.encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> s.decode("utf-8")

Traceback (most recent call last):
  File "<pyshell#110>", line 1, in <module>
    s.decode("utf-8")
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 0: unexpected end of data

halbuki senin yazdığın kodda 'ağlama' kelimesi geçiyordu. ğ harfi ile denedim, onu da utf-8 ile encode edemiyorsun. ı ve ğ için ayrı saçmalıyor.

>>> "ğ".encode("utf-8")
Unsupported characters in input

>>> "ı".encode("utf-8")
Unsupported characters in input

>>> "ü".encode("utf-8")

Traceback (most recent call last):
  File "<pyshell#141>", line 1, in <module>
    "ü".encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128)

dediğim gibi system standartını olduğunu düşündüğüm codecsi alıp onunla decode edince oluyor. \xe4 nin ä olduğu bir char table lazım diye düşündüm. http://scratchpad.wikia.com/wiki/Character_Encoding_Recommendation_for_Languages

>>> s.decode("windows-1250")
u'\xe4'
>>> s
'\xe4'

>>> decoded = s.decode("windows-1250")
>>> print decoded, type(decoded)
ä <type 'unicode'>

decoded type olarak unicode. bu aynı zamanda utf8 ile encode edildi mi demek? decode ettiğim şeyi, senin yaptığın gibi bu defa utf-8 olarak encode etmek istiyorum, gene saçma sapan bir şey çıkıyor.

>>> safe_txt = decoded.encode("utf-8")
>>> safe_txt
'\xc3\xa4'
>>> print safe_txt
Ã¤

EyeCon commented 7 years ago

Senin sistemde genel bir sıkıntı mı var acaba? Bunları hangi işletim sistemi ve hangi konsolda deniyorsun? Locale ayarların nedir?

laucianexones commented 7 years ago

iki ayrı windows 10 makinada deneyim. biri kendi makinam, öteki müşteri makinası. yukarda yazdıklarım python IDLE Shell ekranı. aynı ekranda locale değerleri şöyle gözüküyor.

>>> locale.getdefaultlocale()
('en_US', 'cp1252')

EyeCon commented 7 years ago

Çok ilginç. Benim IDLE sonuçları:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import os
>>> for a, b, c in os.walk(r"c:\temp1\keremdeneme"):
    print(a, b, c)

c:\temp1\keremdeneme ['at_ğüp1', 'ĞÜŞİÖÇ1'] []
c:\temp1\keremdeneme\at_ğüp1 [] ['test1', 'test2_ĞÜ']
c:\temp1\keremdeneme\ĞÜŞİÖÇ1 [] []
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'cp1252')

Bir makineye erişim vermen mümkün mü, deneyeyim?

EyeCon commented 7 years ago

Bir saniye, söylemiştin ama jeton yeni düştü, sen bunların hepsini python2'yle mi deniyorsun? Orada olmaması normaldir (!). python3 8. yaşına girmek üzere, ona geçmek mümkün değil mi?

laucianexones commented 7 years ago

soru değil, sadece not almak için yazıyorum. Doğrudan shell interpreter da denersek unsupported character input diyor.

Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32
>>> a = 'ığ'
Unsupported characters in input

aynı işlemi .py dosyası olarak kayıt edip interpreterda çalıştırırsam sorun olmuyor, istediğim sonucu alıyorum.

s = "ığ"

print s, type(s)

#s_u = s.decode("utf-8")

s_u = unicode(s,"utf-8")
print s_u, type(s_u)

byte_str = s_u.encode("utf-8")
print byte_str, type(byte_str)

====================== RESTART: C:\Python27\encoding.py ======================
Ä±ÄŸ <type 'str'>
ığ <type 'unicode'>
Ä±ÄŸ <type 'str'>

laucianexones commented 7 years ago

sanırım anladım olayı. işimi hallettim.

import os

path_byte_str = r'C:\Users\kcumhurx\Documents\testFolder-ığşçöü'
print path_byte_str, type(path_byte_str)

unicode_path = path_byte_str.decode("utf-8")
print unicode_path, type(unicode_path)

def list_files(startpath):
    print '--'*5
    if not os.path.exists(startpath):
        print '%s yoktir' % startpath

    for root, dirs, files in os.walk(startpath):
        for f in files:
            if os.path.exists(os.path.join(root,f)):
                print f, type(f)

list_files(path_byte_str)
list_files(unicode_path)

Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> 
======================= RESTART: C:/Python27/listF.py =======================
C:\Users\kcumhurx\Documents\testFolder-Ä±ÄŸÅŸÃ§Ã¶Ã¼ <type 'str'>
C:\Users\kcumhurx\Documents\testFolder-ığşçöü <type 'unicode'>
----------
C:\Users\kcumhurx\Documents\testFolder-Ä±ÄŸÅŸÃ§Ã¶Ã¼ yoktir
----------
ığ.txt <type 'unicode'>

laucianexones / encryptBackup

windows altında os.walk ile encoding sorunu #1