iambus / xunlei-lixian

迅雷离线下载脚本
Other
1.97k stars 378 forks source link

字符集问题 #105

Closed zend2 closed 11 years ago

zend2 commented 12 years ago

我运行你的程序后出现下面提示: output_name = escape_filename(task['name']).encode(default_encoding) UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)

这个是缺省字符集(default_encoding)问题,需要把latin-1改成utf-8字符集,如何更改呢?

iambus commented 12 years ago

lx config encoding utf-8

elboble commented 12 years ago

这软件太好了,特别适合我的环境,笔记本硬盘太小,linux服务器上原来只能用mldonkey,但是中国还是迅雷最快。 我觉得如果有一个web还是更方便点,我想来尝试弄一下,不知道你有什么意见。

utf8的问题,即使用了上面的config encoding utf8还是不行。直接在迅雷的网页上添加可以。然后再list就看的到了,可以下载。 root@iconnect:~/iambus-xunlei-lixian-a3bd7c0# ./lixian_cli.py list 111464630785 陨落星辰.Falling.Skies.S02E02.Chi_Eng.HDTVrip.624X352-YYeTs人人影视.rmvb completed 111464531201 陨落星辰.Falling.Skies.S02E01.Chi_Eng.HDTVrip.624X352-YYeTs人人影视.rmvb completed

root@iconnect:~/iambus-xunlei-lixian-a3bd7c0# locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE=zh_CN.UTF8 LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

root@iconnect:~/iambus-xunlei-lixian-a3bd7c0# ./lixian_cli.py download thunder://QUFlZDJrOi8vfGZpbGV81MnC5NDHs70uRmFsbGluZy5Ta2llcy5TMDJFMDEuQ2hpX0VuZy5IRFRWcmlwLjYyNFgzNTItWVllVHPIy8jL07DK0y5ybXZifDE3NzQ3NTcyNHxmMWRmZmRiYjhmYzgxMDY3NTUxOTQ1N2ViZjlkYjA4OXxoPWJwaDU0bGprcXd6Y2kzaXVhY3didmxmbnJ2YnRweWNvfC9aWg== Traceback (most recent call last): File "./lixian_cli.py", line 529, in execute_command() File "./lixian_cli.py", line 526, in execute_command commandscommand File "./lixian_cli.py", line 318, in download_task tasks = find_tasks_to_download(client, args) File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_tasks.py", line 254, in find_tasks_to_download return find_normal_tasks_to_download(client, links) File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_tasks.py", line 218, in find_normal_tasks_to_download found, missing, all = search_in_tasks(all_tasks, links) File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_tasks.py", line 125, in search_in_tasks task = find_task_by_url_or_path(tasks, x) File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_tasks.py", line 59, in find_task_by_url_or_path return find_task_by_url(tasks, url) File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_tasks.py", line 54, in find_task_by_url if link_equals(t['original_url'], url): File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_tasks.py", line 36, in link_equals return link_normalize(x1) == link_normalize(x2) File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_tasks.py", line 28, in link_normalize return lixian_hash_ed2k.parse_ed2k_id(url) File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_hash_ed2k.py", line 50, in parse_ed2k_id return parse_ed2k_link(link)[1:] File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_hash_ed2k.py", line 47, in parse_ed2k_link return unquote_url(name), hash_hex.lower(), int(file_size) File "/root/iambus-xunlei-lixian-a3bd7c0/lixian_url.py", line 70, in unquote_url return x.decode('utf-8') if type(x) == str else x File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xd4 in position 0: invalid continuation byte

iambus commented 12 years ago

这个是url本身的编码不规范。那个thunder://转换成正常的url是: ed2k://|file|%D4%C9%C2%E4%D0%C7%B3%BD.Falling.Skies.S02E01.Chi_Eng.HDTVrip.624X352-YYeTs%C8%CB%C8%CB%D3%B0%CA%D3.rmvb|177475724|f1dffdbb8fc810675519457ebf9db089|h=bph54ljkqwzci3iuacwbvlfnrvbtpyco|/

%D4%C9%C2%E4%D0%C7%B3%BD这部分是gbk编码的。而正常来说比较规范的做法是用utf-8。如果这类的url比较多倒也可以考虑在代码里做个额外的检查。

web我倒是无所谓了。技术上肯定是可以做的,不过感觉没有特别好的方式。你要是感兴趣当然可以自己尝试下。

PS:没想到Falling Skies居然出第二季了。我还以为被砍了。

elboble commented 12 years ago

哈,我也是今天才看到的,情节有点拖沓,不过最近都荒了,没啥看的了,将就了。

yyets上遇到好几个这样有问题的链接。

web有没有什么思路,我觉得mldonkey那个看起来不错,不过估计很复杂,还有人推荐过跳蚤的http://binux.github.com/yaaw/,但是都不是很熟悉。

我找了个功能比较强的平台,Iomega的iConnect,1G CPU,256M ram,512M flash,还有1GE,4USB,现在在U盘上跑了个基本完整的debian,所以资源应该没什么限制,呵呵。

iambus commented 12 years ago

最近半年基本没看美剧,积了几十集。就看了TV双璧的福尔摩斯和冰火。动画看的多点。

elboble commented 12 years ago

有没有mail或者其他联系方式,git上灌水有点夸张了,:-)

iambus commented 12 years ago

github站内私信或者iambus@gmail.com都可以。

yy0c commented 12 years ago

shell 不支持中文,所以 wget 下载失败有什么解决方法吗?

root@TP-LINK:~# lx list 41292857660 Drop.Dead.Diva.S04E03.Chi_Eng.HR-HDTV.AAC.1024X576.x264-YYeTs.mkv completed 41292412220 罗马.Rome.S02E01.CN.BluRay.HR-HDTV.AC3.1024X576.x264-YYeTs人人影视.mkv completed 41242856252 诉讼åŒé›„.Suits.s02e01.Chi_Eng.WEB-HR.AC3.1024X576.x264-YYeTs人人影视.mkv completed

root@TP-LINK:~# lx download --delete --output-dir /mnt/sda1/ 41242856252 Downloading 诉讼åŒé›„.Suits.s02e01.Chi_Eng.WEB-HR.AC3.1024X576.x264-YYeTs人人影视.mkv ... /mnt/sda1/诉讼åŒé›„.Suits.s02e01.Chi_Eng.WEB-HR.AC3.1024X576.x264-YYeTs人人影视.mkv: Invalid argument Traceback (most recent call last): File "/usr/bin/lx", line 529, in execute_command() File "/usr/bin/lx", line 526, in execute_command commandscommand File "/usr/bin/lx", line 323, in download_task download_multiple_tasks(client, download, tasks, download_args) File "/usr/bin/lx", line 298, in download_multiple_tasks download_single_task(client, download, task, options) File "/usr/bin/lx", line 291, in download_single_task download2(client, download_url, output_path, task) File "/usr/bin/lx", line 208, in download2 download1(client, url, path, size) File "/usr/bin/lx", line 190, in download1 download(client, url, path) File "/usr/bin/lx", line 108, in wget_download raise Exception('wget exited abnormaly') Exception: wget exited abnormaly

elboble commented 12 years ago

TPlink?不能增加个zh_CN.utf8? 实在不行,就在迅雷的web上改个名字吧:-)

iambus commented 12 years ago

@younyang 运行下lx diagnostics把输出贴出来看看?

yy0c commented 12 years ago

@iambus root@TP-LINK:/mnt/sda1# lx diagnostics sys.getdefaultencoding() -> ascii sys.getfilesystemencoding() -> ASCII print u'\u4e2d\u6587'.encode('utf-8') -> 中文 print u'\u4e2d\u6587'.encode('gbk') -> 中文

@elboble 请问如何操作?

elboble commented 12 years ago

我这些都是邪门歪道,正经的还得听bus的。

如果你的tplink上有空间跑debian,那dpkg-reconfigure locale,选上zh_CN.utf8,如果没有手工拷一个过去估计也行。

iambus commented 12 years ago

@younyang 先试试 lx config encoding gbk

还有问题的话再把完整错误贴出来。 (有时候和文件系统挂载的编码也有关系,还有问题的话估计要看下挂载用的编码参数了。)

oTnTh commented 12 years ago
diff --git a/lixian_url.py b/lixian_url.py
index aa7bb76..e5eaea5 100644
--- a/lixian_url.py
+++ b/lixian_url.py
@@ -1,6 +1,7 @@

 import base64
 import urllib
+import locale

 def xunlei_url_encode(url):
        return 'thunder://'+base64.encodestring('AA'+url+'ZZ')
@@ -67,5 +68,11 @@ def normalize_unicode_link(url):

 def unquote_url(x):
        x = urllib.unquote(x)
-       return x.decode('utf-8') if type(x) == str else x
+       if type(x) == str:
+               try:
+                       return x.decode('utf-8')
+               except UnicodeDecodeError:
+                       return x.decode(locale.getdefaultlocale()[1])
+       else:
+               return x

@iambus 最近经常逛人人,发现那上面GBK编码的链接一把一把的,所以。我觉得这个也不算很好的解决方案,不过至少能用就是了。

iambus commented 12 years ago

@oTnTh 我觉得这个fix用处是不大的。因为链接所用的编码取决于网站,和你机器上的defaultlocale是毫无关系的。只是你的机器上凑巧是gbk的。

oTnTh commented 12 years ago

@iambus 用getdefaultlocale是考虑到或许繁体中文那边也可能存在类似的问题,不过locale.getdefaultlocale()[1]的确是不对,因为没考虑到linux都用utf-8的问题。按照我原本打算采用的逻辑,应该判断locale.getdefaultlocale()[0],zh_CN就试cp936,zh_TW就试cp950。不过你现在这种改法我也没意见,至少能用了…

kai-chen-ustc commented 12 years ago

@iambus 感谢作者分享,早就想写一个这样的东西终于一搜发现已经存在了 :)

我这里也存在字符集问题,之前已经

lx config encoding utf8 

然而下好文件的文件名仍然是以 gbk 编码的:

convmv -f gbk -t utf-8 *
mv "./[ʧ��33��].Love.Is.Not.Blind.2011.CHINESE.DVDRip.XviD-WZW.[dybee.com].avi""./[失恋33天].Love.Is.Not.Blind.2011.CHINESE.DVDRip.XviD-WZW.[dybee.com].avi"
$ lx diagnostic
default_encoding -> utf8
sys.getdefaultencoding() -> ascii
sys.getfilesystemencoding() -> UTF-8
print u'\u4e2d\u6587'.encode('utf-8') -> 中文
print u'\u4e2d\u6587'.encode('gbk') -> ����

$ locale
LANG=en_US.UTF-8
LANGUAGE=zh_CN:zh:en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
iambus commented 12 years ago

资源的原始链接能否提供下?

kai-chen-ustc commented 12 years ago

@iambus 感谢回复

ed2k://|file|[%E5%A4%B1%E6%81%8B33%E5%A4%A9].Love.Is.Not.Blind.2011.CHINESE.DVDRip.XviD-WZW.[dybee.com].avi|728666112|BF418BC3AE4B4503071EE4CD07DC49C8|/

iambus commented 12 years ago

没发现有这个问题。用的下载工具是什么(wget/asyn/aria2)?所有中文名的资源都存成gbk了吗?

kai-chen-ustc commented 12 years ago

抱歉,我重新下载以后又好了,似乎是之前误操作把字符集设置成 gbk 了