Closed Cologler closed 3 years ago
Thank you for the report. Would be nice if you attach one of such files.
I upload file to github always fail, so I upload it to onedrive: http://1drv.ms/1O1lW9K
Got it. I'll check that out this week.
So it appers that BitComet somehow misinterprets bittorrent spec, namely:
All character string values are UTF-8 encoded.
And
encoding: (optional) the string encoding format used to generate the pieces part of the info dictionary in the .torrent metafile
This torrent is using encoding
param to encode all
string values usign GBK encoding.
I'm thinking of somekind of compatibility mode for such torrents. Maybe in the next version. But before it could be implemented we need to try to investigate reasoning for such a misinterpretation.
This is still not fixed 5 years later...
This is still not fixed 5 years later...
And will never be probably, since:
But before it could be implemented we need to try to investigate reasoning for such a misinterpretation.
I switched to bencode.py
since then which works fine for my use case (getting total size of files in a torrent). The thing I fail to understand here is, why does this library need to meddle with the filenames, making some of them strings and others bytes? Keeping them all consistently either strings or bytes would solve the issue, wouldn't it?
I'm probably missing something, because the issue is about handling of files made with clients that do not store all character string values in UTF-8. Are we talking about the same thing?
With some print()
statements added:
>>> from torrentool.api import Torrent
>>> Torrent.from_file('[nCore][xvidser_hun]Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay.torrent')
<torrentool.torrent.Torrent object at 0x7f643be92c70>
>>> torrent = Torrent.from_file('[nCore][xvidser_hun]Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay.torrent')
>>> torrent.total_size()
Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay ['Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay.mkv'] 176088721
Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay ['Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay.nfo'] 18570
Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay [b'J\xf3ban Rosszban [2005] Bor\xedt\xf3.jpg'] 116213
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nyuszika7h/.pyenv/versions/ncannounce/lib/python3.9/site-packages/torrentool/torrent.py", line 107, in total_size
return reduce(lambda prev, curr: prev + curr[1], self.files, 0)
File "/home/nyuszika7h/.pyenv/versions/ncannounce/lib/python3.9/site-packages/torrentool/torrent.py", line 97, in files
files.append(TorrentFile(join(base, *f['path']), f['length']))
File "/home/nyuszika7h/.pyenv/versions/3.9.1/lib/python3.9/posixpath.py", line 90, in join
genericpath._check_arg_types('join', a, *p)
File "/home/nyuszika7h/.pyenv/versions/3.9.1/lib/python3.9/genericpath.py", line 155, in _check_arg_types
raise TypeError("Can't mix strings and bytes in path components") from None
TypeError: Can't mix strings and bytes in path components
>>>
Here's the .torrent file: https://femto.pw/ui8x
The filename that contains an accented character is suddenly bytes instead of str like the rest, which seems to cause the error. If they were all str or all bytes, this wouldn't happen as far as I understand.
So that's the same issue.
ruTorrent (PHP Class - Adrien Gibrat)
created this torrent file where string value of J\xf3ban Rosszban [2005] Bor\xedt\xf3.jpg
is not UTF-8 as per spec (should rather be J\xc3\xb3ban Rosszban [2005] Bor\xc3\xadt\xc3\xb3.jpg
).
So torrentool
follows the spec and expects strings (including filepaths) to be UTF-8 encoded. If it fails to decode it keeps bytes as they are.
Moreover in contrast to BitComet used by the issue starter ruTorrent (PHP Class - Adrien Gibrat)
doesn't even give any hint about which encoding is used.
If they were all str or all bytes, this wouldn't happen as far as I understand.
Rather if strings were UTF-8 this wouldn't happen, yes. We could try and apply additional postprocessing, for example skip bytes we cannot decode, but in many cases it'll left us with empty strings.
Maybe the file is non-compliant, but when I just want to get the total size of files in a torrent the encoding is irrelevant. Maybe an "override encoding" or "leave everything as bytes" option could be helpful (though in my case bencode.py
does the job and it's not much more extra work for me to get the total size of all files). Or at least improve the error message, like "unable to decode [filename]". I'm guessing it leaves it as bytes in case some operations don't need to care about that part, but as far as I can see size shouldn't either.
I'm guessing it leaves it as bytes in case some operations don't need to care about that part, but as far as I can see size shouldn't either.
Yes, maybe we'd special case size op. I'll try to look into it in a week.
Implemented an alternative solution.
You may want to give it a try in master
.
Yeah, that seems to work, it returned the total size now.
1.1.1 is out Considered closed. Feel free to reopen if required.
error code:
added print code
print(type(base), type(*f['path']))
after line 34, it print: