Torrents with non-UTF-8 strings are improperly decoded

squidneypoitier commented 7 years ago

Versions

qBittorrent version and Operating System: 3.3.10, Arch Linux libtorrent: 1.1.1.0 Qt: 5.7.1

What is the problem:

Cross-posting this from arvidn/libtorrent#1780, because this project is affected and likely something needs to be done here even if changes are made upstream.

Quoting from that thread for convenience:

Some .torrent files (possibly not made with libtorrent) contain filenames and names encoded using a string encoding other than UTF-8. However, libtorrent.torrentinfo seems to assume all bytes strings are UTF-8 encoded, and will in fact silently convert codepoints that do not correspond to a valid UTF-8 codepoint into a replacement character (seems to be from what I see).

Transmission properly decodes iso8859_2 strings, but qBittorrent fails on this score.

What is the expected behavior:

Copied from the libtorrent issue, here is an MWE using a torrent with filenames encoded in iso8859_2:

import libtorrent as lt
E_TORR_LOC = 'example.torrent'
with open(E_TORR_LOC, 'rb') as f:
    torr_in = lt.bdecode(f.read())

info = lt.torrent_info(E_TORR_LOC)

base_name = torr_in[b'info'][b'name'].decode()
fnames_raw = [f[b'path'][0].decode('iso8859_2') for f in torr_out[b'info'][b'files']]
fpaths_raw = [os.path.join(base_name, fname) for fname in fnames_raw]

fpaths_info = [f.path for f in info.files()]

print('{: ^30} | {: ^30}'.format('bdecode', 'torrent_info'))
print('-' * 63)

for fraw, finfo in zip(fpaths_raw, fpaths_info):
    print('{: ^30} | {: ^30}'.format(fraw, finfo))

Here is the result, the proper behavior is on the left, qBittorrent's behavior is on the right:

           bdecode             |          torrent_info         
---------------------------------------------------------------
  example_torrent/Böser.txt    |     example_torrent/B_.txt    
  example_torrent/ümlaut.txt   |   example_torrent/_mlaut.txt

Steps to reproduce:

The following Python script (Python 3) will create a minimally-working .torrent file that exhibits the improper behavior (also copied from the libtorrent thread):

#! /usr/bin/env python3

# Make the data for a torrent containing some data with unicode filenames
import os

TORR_DIR = 'example_torrent'
TORR_FILES = {
    'Böser.txt': 'Filename contains umlauts',
    'ümlaut.txt': 'Plain unicode'
}
E_TORR_LOC = 'example.torrent'

if not os.path.exists(TORR_DIR):
    os.makedirs(TORR_DIR)

for fname, contents in TORR_FILES.items():
    fpath = os.path.join(TORR_DIR, fname)
    with open(fpath, 'w') as f:
        f.write(contents)

# Create the torrent itself
import libtorrent as lt

fs = lt.file_storage()
lt.add_files(fs, './' + TORR_DIR)

t = lt.create_torrent(fs)
torr_out = t.generate()

# Since libtorrent bindings will use UTF-8 to encode, we need to
# modify the output before writing it to simulate use of another encoding
for file_info in torr_out[b'info'][b'files']:
    path_to_mod = file_info[b'path']
    path_to_mod[0] = path_to_mod[0].decode().encode('iso8859_2')

with open(E_TORR_LOC, 'wb') as f:
    f.write(lt.bencode(torr_out))

Extra info(if any):

This issue may be related to #4479.

sledgehammer999 commented 7 years ago

Thanks for the report. But this is totally a libtorrent "issue". qBittorrent doesn't do the decoding. In fact all the bittorrent stuff is handled by libtorrent. We just query it for info and set the appropriate settings for it work. And as people already noted in the libtorrent issue you opened non-utf8 data aren't actually compliant.

squidneypoitier commented 7 years ago

@sledgehammer999 Actually, I think this is really more of a qBittorrent issue than a libtorrent issue. As I mentioned in the thread over there, I think that libtorrent should expose a mechanism for detecting these sorts of errors and a mechanism for getting at the raw bytestring, but beyond that it would likely be a bad idea for something so low-level to try to get "smart" about encodings.

Regarding the compliance issue - they are not compliant with the current standard, but older, legacy torrents still exist among long-lived torrents. As I mentioned Transmission handles these torrents just fine, likely because they are using some heuristic method to detect incorrect encodings.

My suggested solution is something in between what happens now and what Transmission is doing. I think that qBittorrent should detect when a torrent is improperly encoded, then in the "add torrent" dialog, it should show an "encoding warning" - possibly as a pop-up with a list of strings that failed to properly decode. Then the user can be presented with a list of alternate encodings (possibly ordered by whatever heuristic Transmission is using, like putting iso-8859-1 high on the list, possibly filtered by ones that decode the strings without issue). This will simultaneously solve the encoding problem and put users on notice that the torrents they are seeding / downloading are non-compliant (and prompt them to complain about it to whoever created the torrent, hopefully).

In terms of implementation, this can already be done a bit hackily without any changes in libtorrent by doing what I've done above - load the file paths from libtorrent's interface, then load them again from the bdecode of the .torrent file and compare the two to make sure that they match. Ideally, libtorrent will expose some interface for detecting decoding errors and it would be unnecessary to do the hack.

qbittorrent / qBittorrent