torrust / teps

Torrust Enhancement Proposals
1 stars 2 forks source link

TEP Draft: Conversion Between Bencode and JSON Using Hexadecimal Encoding #16

Open josecelano opened 1 month ago

josecelano commented 1 month ago

Discussed in https://github.com/torrust/teps/discussions/15

Originally posted by **josecelano** February 1, 2024 - Draft: ChatGPT - Todo: research ### Draft **Abstract:** This document proposes a standard for converting data between Bencode, the encoding format used by the BitTorrent protocol, and JSON (JavaScript Object Notation), a widely-used data interchange format. The primary challenge addressed is the representation of binary data in JSON, which is inherently text-based and encodes strings in UTF-8. This proposal recommends using hexadecimal encoding for binary data within JSON and provides a JSON schema for all valid converted objects. 1. Introduction Bencode is a binary format widely used in peer-to-peer file sharing systems, particularly BitTorrent. JSON, on the other hand, is a text-based format used for data interchange on the web. Converting between these two formats requires careful handling of binary data, as JSON does not natively support raw binary data. 2. Hexadecimal Encoding for Binary Data The primary method proposed for handling binary data in Bencode when converting it to JSON is hexadecimal encoding. This approach involves representing each byte of binary data as a two-digit hexadecimal number. For example, a byte with the value `0x1F` in binary would be represented as the string `"1F"` in JSON. **Advantages:** - Hexadecimal encoding is a straightforward, widely understood method. - It ensures compatibility with JSON's text-based format. - The encoded data is somewhat human-readable, which can be beneficial for debugging. **Disadvantages:** - Increased data size due to the encoding (each byte of binary data becomes two characters in JSON). 3. Alternative Methods (Discarded) Other methods considered and discarded include: **a. Base64 Encoding:** Converts binary data into a base-64 representation. While efficient in terms of space, it is less human-readable and can complicate encoding and decoding processes. **b. Array Representation:** Involves representing binary data as an array of byte values in JSON. This method is inefficient in terms of space and handling. **c. Escape Non-UTF8 Sequences:** Attempts to represent binary data as UTF-8 strings by escaping invalid sequences. This approach is complex and not universally applicable. **d. Custom Encoding Scheme:** Utilizes a custom scheme for specific types of binary data. This method would require custom logic for parsing and is less generalizable. 4. JSON Schema for Valid Objects The JSON schema for representing Bencode data in JSON is as follows: ```json { "type": "object", "properties": { "integers": {"type": "integer"}, "strings": {"type": "string"}, "lists": { "type": "array", "items": {/* recursive reference to this schema */} }, "dictionaries": { "type": "object", "additionalProperties": {/* recursive reference to this schema */} }, "binary": {"type": "string", "pattern": "^[0-9A-Fa-f]*$"} } } ``` 5. Examples of Conversion **Example 1: Bencode to JSON Conversion** Bencode: `d3:bar4:spam3:fooi42ee` JSON: `{"bar": "spam", "foo": 42}` Example 2: Handling Binary Data Bencode: `4:\x8A\xE2\x9C\x93` (binary data in Bencode string) JSON: `{"binary": "8AE29C93"}` (hexadecimal encoded) 6. Conclusion This proposal provides a standardized method for converting between Bencode and JSON, with a focus on the proper representation of binary data. By using hexadecimal encoding, we ensure compatibility with JSON's text-based format while maintaining the integrity of the binary data from Bencode. 7. Links Other approaches: - https://chocobo1.github.io/bencode_online/ cc @da2ce7
josecelano commented 3 weeks ago

Someone found an issue with this representation; it's not reversible:

: https://github.com/torrust/bencode2json/issues/7