Open josecelano opened 1 year ago
I did some work related to parsing the torrent file in this repo:
https://github.com/torrust/torrust-parse-torrent
One of the things I wanted to test was a more verbose parser:
https://github.com/torrust/torrust-parse-torrent/blob/main/src/utils/parse_torrent_verbose.rs
The idea is to get an exact error message is one mandatory field is missing.
Anyway, I opened this issue becuase I wanted to show a different message if the user is trying to upload an only-V2 torrent (not hybrid).
I think we can have two issues, one for the v2 torrents and other for the hybrids (in case we want to support them, as they don't seem to be very popular, both v2 and hybrids).
An important feat is to give more concrete errors on mandatory fields.
When the user uploads a torrent we have three different scenarios:
The torrent file is a V1 torrent The torrent file is a V2 torrent The torrent file is a Hybrid torrent
Right now, we only support V1 torrents, and we create an info hash which can be different from the original info hash as we only take into account some of the fields (mandatory fields), discarding the rest.
If the user tries to upload a non valid torrent, a generic error message appears, but it would be a good idea to show more concrete error messages.
To do that, we need to check if the torrent is a V1 | V2 | Hybrid and then check for mandatory fields.
First, I want to work on creating concrete error messages for the V2 torrents, as they are more popular than hybrids (even though both are not widely used).
Then work on creating concrete messages for hybrid torrents.
And lastly, work on concrete messages for mandatory fields (relates to https://github.com/torrust/torrust-parse-torrent that Jose made).
For the V2 torrents, I want to propose two ways of implementing the logic to the team before writing the code.
I came up with these solutions:
Creating a new struct for V2 torrents with the mandatory fields, and after deserialize it, checking if the info dictionary field "meta version" has a value of 2, because V2 torrents have that field with a value of 2.
Checking the length of the info hash of the torrent file, and also checking if the torrent file has one or two info hashes, as hybrid torrents have two info hashes, one for the V1 torrent (SHA-1, 32-40 Hexadecimal characters) and another for the V2 torrent (SHA2/256 with 64 hexadecimal characters).
The Bittorrent client QBitTorrent uses the info hash for differentiate V1 or V2 torrents:
namespace
{
// BEP9 Extension for Peers to Send Metadata Files
bool isV1Hash(const QString &string)
{
// There are 2 representations for BitTorrent v1 info hash:
// 1. 40 chars hex-encoded string
// == 20 (SHA-1 length in bytes) * 2 (each byte maps to 2 hex characters)
// 2. 32 chars Base32 encoded string
// == 20 (SHA-1 length in bytes) * 1.6 (the efficiency of Base32 encoding)
const int V1_HEX_SIZE = SHA1Hash::length() * 2;
const int V1_BASE32_SIZE = SHA1Hash::length() * 1.6;
return ((((string.size() == V1_HEX_SIZE)) && !string.contains(QRegularExpression(u"[^0-9A-Fa-f]"_s)))
|| ((string.size() == V1_BASE32_SIZE) && !string.contains(QRegularExpression(u"[^2-7A-Za-z]"_s))));
}
bool isV2Hash(const QString &string)
{
// There are 1 representation for BitTorrent v2 info hash:
// 1. 64 chars hex-encoded string
// == 32 (SHA-2 256 length in bytes) * 2 (each byte maps to 2 hex characters)
const int V2_HEX_SIZE = SHA256Hash::length() * 2;
return (string.size() == V2_HEX_SIZE) && !string.contains(QRegularExpression(u"[^0-9A-Fa-f]"_s));
}
lt::load_torrent_limits loadTorrentLimits()
{
const auto *pref = Preferences::instance();
lt::load_torrent_limits limits;
limits.max_buffer_size = static_cast<int>(pref->getTorrentFileSizeLimit());
limits.max_decode_depth = pref->getBdecodeDepthLimit();
limits.max_decode_tokens = pref->getBdecodeTokenLimit();
return limits;
}
}
I think the second option is more appropriate, and it might even be faster as the index only has to check if there is one or two info hashes, and then the length of the info hash. While the first approach, requires deserializing all the torrent fields. But this is only an assumption and benchmarks might be needed, and performance wise both approaches should not have a big hit in the overall performance of the index.
What is your opinion on the above? @da2ce7 @WarmBeer @josecelano
What is your opinion on the above? @da2ce7 @WarmBeer @josecelano
Hi @mario-nt, correct me if I'm wrong:
info
key, but using a different hash function.I would like to avoid making the upload even slower adding an extra parsing or hash function if possible. Although I'm not especially worried about performance for this endpoint. But maybe I should. We have been testing with small torrents, for big torrent files parsing twice can take too long.
So I would prefer not to parse the torrent with a V2 struct just to know if it fits into a V2 torrent.
We are already parsing the torrent twice:
info
key.-Torrent
struct.This is just an idea.
The function to calculate the actual info-hash actually parses the whole file into this struct:
struct ParsedInfoDictFromMetainfoFile {
pub info: Value,
}
Because we only want the info
key. If the info
key contains the meta version
we can assume that the torrent is either a v1 or hybrid torrent.
The calculate_info_hash
could be renamed to preparse_torrent
and return both:
I think we only need to take the info: Value
and extract the meta version
from it. So that we do not need to parse again the torrent file.
Finally in the decode_and_validate_torrent_file
, when we parse the torrent to build the Torrent
strcut if we have an error we can return a new error:
DecodeTorrentFileError::InvalidBencodeDataV1
DecodeTorrentFileError::InvalidBencodeDataV2OrHybrid
With a new error, we can change the error messages to something like:
On the other hand, I think the "Alternative approach 1" is too complex. I would implement the straight one (try to parse both TorrentV1 and Torrent V2 structs) even if te performance is worse. Maybe it's just a couple of milliseconds more and it does not worth it adding that complexity.
We create a new strcut abstraction GenericTorrent
where we include all the possible fields in a torrent file: v1, v2, non-standard fields, etcetra. We only parse this once, and them we try to build an v1 torrent from this generic strcut. If we cannot build the torrent v1 then we can check without parsing again if the torrent is v2. In fact, this solution can also be used to show more concrete messages when a mandatory field is missing. Maybe it's even a better option than the one I mentioned avobe.
Besides, when we add the log probably we are going to need something like this to extract all the info from the original torrent file.
I would explore alternative approach 3 a little bit. If it not feasible for some reason I can't see now I would go for parsing again for TorrentV2
strcut.
Article about BitTorrent V2: https://blog.libtorrent.org/2020/09/bittorrent-v2/
I'm moving this to the next big milestone because it's not clear to me if we should support V2 torrents. It looks like other projects are even removing support for V2 torrents.
If you upload a torrent with only the field for version 2, you get a generic "Internal Server Error".
We should show a message like "Invalid torrent. BitTorrent version 2 is not supported."