Failed to parse torrents with all string values encoded with `encoding`

Cologler commented 8 years ago

error code:

Traceback (most recent call last):
  File "...\torrentool-0.2.0\torrentool-0.2.0\Untitled-1.py", line 15, in <module>
    print(torrent.total_size())
  File "...\torrentool-0.2.0\torrentool-0.2.0\torrentool\torrent.py", line 45, in total_size
    return reduce(lambda prev, curr: prev + curr[1], self.files, 0)
  File "...\torrentool-0.2.0\torrentool-0.2.0\torrentool\torrent.py", line 35, in files
    files.append((join(base, *f['path']), f['length']))
  File "...\AppData\Local\Programs\Python\Python35\lib\ntpath.py", line 113, in join
    genericpath._check_arg_types('join', path, *paths)
  File "...\AppData\Local\Programs\Python\Python35\lib\genericpath.py", line 145, in _check_arg_types
    raise TypeError("Can't mix strings and bytes in path components") from None
TypeError: Can't mix strings and bytes in path components

added print code print(type(base), type(*f['path'])) after line 34, it print:

<class 'bytes'> <class 'bytes'>
<class 'bytes'> <class 'str'>

idlesign commented 8 years ago

Thank you for the report. Would be nice if you attach one of such files.

Cologler commented 8 years ago

I upload file to github always fail, so I upload it to onedrive: http://1drv.ms/1O1lW9K

idlesign commented 8 years ago

Got it. I'll check that out this week.

idlesign commented 8 years ago

So it appers that BitComet somehow misinterprets bittorrent spec, namely:

All character string values are UTF-8 encoded.

And

encoding: (optional) the string encoding format used to generate the pieces part of the info dictionary in the .torrent metafile

This torrent is using encoding param to encode all string values usign GBK encoding.

I'm thinking of somekind of compatibility mode for such torrents. Maybe in the next version. But before it could be implemented we need to try to investigate reasoning for such a misinterpretation.

ghost commented 3 years ago

This is still not fixed 5 years later...

idlesign commented 3 years ago

This is still not fixed 5 years later...

And will never be probably, since:

But before it could be implemented we need to try to investigate reasoning for such a misinterpretation.

ghost commented 3 years ago

I switched to bencode.py since then which works fine for my use case (getting total size of files in a torrent). The thing I fail to understand here is, why does this library need to meddle with the filenames, making some of them strings and others bytes? Keeping them all consistently either strings or bytes would solve the issue, wouldn't it?

idlesign commented 3 years ago

I'm probably missing something, because the issue is about handling of files made with clients that do not store all character string values in UTF-8. Are we talking about the same thing?

ghost commented 3 years ago

With some print() statements added:

>>> from torrentool.api import Torrent
>>> Torrent.from_file('[nCore][xvidser_hun]Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay.torrent')
<torrentool.torrent.Torrent object at 0x7f643be92c70>
>>> torrent = Torrent.from_file('[nCore][xvidser_hun]Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay.torrent')
>>> torrent.total_size()
Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay ['Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay.mkv'] 176088721
Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay ['Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay.nfo'] 18570
Joban.Rosszban.2021.02.15.WEB-DLRip.x264.AAC2.0.Hun-TheMilkyWay [b'J\xf3ban Rosszban [2005] Bor\xedt\xf3.jpg'] 116213
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nyuszika7h/.pyenv/versions/ncannounce/lib/python3.9/site-packages/torrentool/torrent.py", line 107, in total_size
    return reduce(lambda prev, curr: prev + curr[1], self.files, 0)
  File "/home/nyuszika7h/.pyenv/versions/ncannounce/lib/python3.9/site-packages/torrentool/torrent.py", line 97, in files
    files.append(TorrentFile(join(base, *f['path']), f['length']))
  File "/home/nyuszika7h/.pyenv/versions/3.9.1/lib/python3.9/posixpath.py", line 90, in join
    genericpath._check_arg_types('join', a, *p)
  File "/home/nyuszika7h/.pyenv/versions/3.9.1/lib/python3.9/genericpath.py", line 155, in _check_arg_types
    raise TypeError("Can't mix strings and bytes in path components") from None
TypeError: Can't mix strings and bytes in path components
>>>

Here's the .torrent file: https://femto.pw/ui8x

The filename that contains an accented character is suddenly bytes instead of str like the rest, which seems to cause the error. If they were all str or all bytes, this wouldn't happen as far as I understand.

idlesign commented 3 years ago

So that's the same issue.

ruTorrent (PHP Class - Adrien Gibrat) created this torrent file where string value of J\xf3ban Rosszban [2005] Bor\xedt\xf3.jpg is not UTF-8 as per spec (should rather be J\xc3\xb3ban Rosszban [2005] Bor\xc3\xadt\xc3\xb3.jpg).

So torrentool follows the spec and expects strings (including filepaths) to be UTF-8 encoded. If it fails to decode it keeps bytes as they are.

Moreover in contrast to BitComet used by the issue starter ruTorrent (PHP Class - Adrien Gibrat) doesn't even give any hint about which encoding is used.

If they were all str or all bytes, this wouldn't happen as far as I understand.

Rather if strings were UTF-8 this wouldn't happen, yes. We could try and apply additional postprocessing, for example skip bytes we cannot decode, but in many cases it'll left us with empty strings.

ghost commented 3 years ago

Maybe the file is non-compliant, but when I just want to get the total size of files in a torrent the encoding is irrelevant. Maybe an "override encoding" or "leave everything as bytes" option could be helpful (though in my case bencode.py does the job and it's not much more extra work for me to get the total size of all files). Or at least improve the error message, like "unable to decode [filename]". I'm guessing it leaves it as bytes in case some operations don't need to care about that part, but as far as I can see size shouldn't either.

idlesign commented 3 years ago

I'm guessing it leaves it as bytes in case some operations don't need to care about that part, but as far as I can see size shouldn't either.

Yes, maybe we'd special case size op. I'll try to look into it in a week.

idlesign commented 3 years ago

Implemented an alternative solution. You may want to give it a try in master.

ghost commented 3 years ago

Yeah, that seems to work, it returned the total size now.

idlesign commented 3 years ago

1.1.1 is out Considered closed. Feel free to reopen if required.

idlesign / torrentool

Failed to parse torrents with all string values encoded with `encoding` #2