torrust / torrust-index

This repository serves as the backend for the Torrust Index project.
https://torrust.com
GNU Affero General Public License v3.0
53 stars 19 forks source link

Improve error message for v2-only torrents #300

Open josecelano opened 1 year ago

josecelano commented 1 year ago

If you upload a torrent with only the field for version 2, you get a generic "Internal Server Error".

We should show a message like "Invalid torrent. BitTorrent version 2 is not supported."

josecelano commented 1 year ago

I did some work related to parsing the torrent file in this repo:

https://github.com/torrust/torrust-parse-torrent

One of the things I wanted to test was a more verbose parser:

https://github.com/torrust/torrust-parse-torrent/blob/main/src/utils/parse_torrent_verbose.rs

The idea is to get an exact error message is one mandatory field is missing.

josecelano commented 1 year ago

Anyway, I opened this issue becuase I wanted to show a different message if the user is trying to upload an only-V2 torrent (not hybrid).

mario-nt commented 12 months ago

I think we can have two issues, one for the v2 torrents and other for the hybrids (in case we want to support them, as they don't seem to be very popular, both v2 and hybrids).

An important feat is to give more concrete errors on mandatory fields.

mario-nt commented 12 months ago

When the user uploads a torrent we have three different scenarios:

The torrent file is a V1 torrent The torrent file is a V2 torrent The torrent file is a Hybrid torrent

Right now, we only support V1 torrents, and we create an info hash which can be different from the original info hash as we only take into account some of the fields (mandatory fields), discarding the rest.

If the user tries to upload a non valid torrent, a generic error message appears, but it would be a good idea to show more concrete error messages.

To do that, we need to check if the torrent is a V1 | V2 | Hybrid and then check for mandatory fields.

First, I want to work on creating concrete error messages for the V2 torrents, as they are more popular than hybrids (even though both are not widely used).

Then work on creating concrete messages for hybrid torrents.

And lastly, work on concrete messages for mandatory fields (relates to https://github.com/torrust/torrust-parse-torrent that Jose made).

For the V2 torrents, I want to propose two ways of implementing the logic to the team before writing the code.

I came up with these solutions:

  1. Creating a new struct for V2 torrents with the mandatory fields, and after deserialize it, checking if the info dictionary field "meta version" has a value of 2, because V2 torrents have that field with a value of 2.

  2. Checking the length of the info hash of the torrent file, and also checking if the torrent file has one or two info hashes, as hybrid torrents have two info hashes, one for the V1 torrent (SHA-1, 32-40 Hexadecimal characters) and another for the V2 torrent (SHA2/256 with 64 hexadecimal characters).

The Bittorrent client QBitTorrent uses the info hash for differentiate V1 or V2 torrents:

https://github.com/qbittorrent/qBittorrent/blob/298e4ba852a7764dc4e5783878f51b93581937a4/src/base/bittorrent/torrentdescriptor.cpp#L49-L88

namespace
{
    // BEP9 Extension for Peers to Send Metadata Files

    bool isV1Hash(const QString &string)
    {
        // There are 2 representations for BitTorrent v1 info hash:
        // 1. 40 chars hex-encoded string
        //      == 20 (SHA-1 length in bytes) * 2 (each byte maps to 2 hex characters)
        // 2. 32 chars Base32 encoded string
        //      == 20 (SHA-1 length in bytes) * 1.6 (the efficiency of Base32 encoding)
        const int V1_HEX_SIZE = SHA1Hash::length() * 2;
        const int V1_BASE32_SIZE = SHA1Hash::length() * 1.6;

        return ((((string.size() == V1_HEX_SIZE)) && !string.contains(QRegularExpression(u"[^0-9A-Fa-f]"_s)))
                || ((string.size() == V1_BASE32_SIZE) && !string.contains(QRegularExpression(u"[^2-7A-Za-z]"_s))));
    }

    bool isV2Hash(const QString &string)
    {
        // There are 1 representation for BitTorrent v2 info hash:
        // 1. 64 chars hex-encoded string
        //      == 32 (SHA-2 256 length in bytes) * 2 (each byte maps to 2 hex characters)
        const int V2_HEX_SIZE = SHA256Hash::length() * 2;

        return (string.size() == V2_HEX_SIZE) && !string.contains(QRegularExpression(u"[^0-9A-Fa-f]"_s));
    }

    lt::load_torrent_limits loadTorrentLimits()
    {
        const auto *pref = Preferences::instance();

        lt::load_torrent_limits limits;
        limits.max_buffer_size = static_cast<int>(pref->getTorrentFileSizeLimit());
        limits.max_decode_depth = pref->getBdecodeDepthLimit();
        limits.max_decode_tokens = pref->getBdecodeTokenLimit();

        return limits;
    }
}

I think the second option is more appropriate, and it might even be faster as the index only has to check if there is one or two info hashes, and then the length of the info hash. While the first approach, requires deserializing all the torrent fields. But this is only an assumption and benchmarks might be needed, and performance wise both approaches should not have a big hit in the overall performance of the index.

mario-nt commented 12 months ago

What is your opinion on the above? @da2ce7 @WarmBeer @josecelano

josecelano commented 12 months ago

What is your opinion on the above? @da2ce7 @WarmBeer @josecelano

Hi @mario-nt, correct me if I'm wrong:

  1. In all of them (v1, v2 and hybrid) you calculate the info-hash by hashing the whole info key, but using a different hash function.
  2. We need to know first if the torrent is v2 or hybrid to calculate the info-hash with the SHA-256 function. We first check the content and then we calculate SHA-1 for v1 and SHA256 for v2 and hybrid torrents. So even if you use the hash to know the version, so need to know first the version.

Other considerations

I would like to avoid making the upload even slower adding an extra parsing or hash function if possible. Although I'm not especially worried about performance for this endpoint. But maybe I should. We have been testing with small torrents, for big torrent files parsing twice can take too long.

So I would prefer not to parse the torrent with a V2 struct just to know if it fits into a V2 torrent.

Alternative approach 1

We are already parsing the torrent twice:

  1. First time to calculate the actual info-hash (SHA-1) because we want to include all fields in the info key.-
  2. Secondly to actually map the torrent file into the Rust Torrent struct.

This is just an idea.

The function to calculate the actual info-hash actually parses the whole file into this struct:

struct ParsedInfoDictFromMetainfoFile {
    pub info: Value,
}

Because we only want the info key. If the info key contains the meta version we can assume that the torrent is either a v1 or hybrid torrent.

The calculate_info_hash could be renamed to preparse_torrent and return both:

I think we only need to take the info: Value and extract the meta version from it. So that we do not need to parse again the torrent file.

Finally in the decode_and_validate_torrent_file, when we parse the torrent to build the Torrent strcut if we have an error we can return a new error:

With a new error, we can change the error messages to something like:

Alternative approach 2

On the other hand, I think the "Alternative approach 1" is too complex. I would implement the straight one (try to parse both TorrentV1 and Torrent V2 structs) even if te performance is worse. Maybe it's just a couple of milliseconds more and it does not worth it adding that complexity.

Alternative approach 3

We create a new strcut abstraction GenericTorrent where we include all the possible fields in a torrent file: v1, v2, non-standard fields, etcetra. We only parse this once, and them we try to build an v1 torrent from this generic strcut. If we cannot build the torrent v1 then we can check without parsing again if the torrent is v2. In fact, this solution can also be used to show more concrete messages when a mandatory field is missing. Maybe it's even a better option than the one I mentioned avobe.

Besides, when we add the log probably we are going to need something like this to extract all the info from the original torrent file.

Conclusion

I would explore alternative approach 3 a little bit. If it not feasible for some reason I can't see now I would go for parsing again for TorrentV2 strcut.

josecelano commented 12 months ago

Article about BitTorrent V2: https://blog.libtorrent.org/2020/09/bittorrent-v2/

josecelano commented 8 months ago

I'm moving this to the next big milestone because it's not clear to me if we should support V2 torrents. It looks like other projects are even removing support for V2 torrents.