Support file uploads - Githubissues

nmlorg commented 3 years ago

This is a doozy.

Right now, things like bot.send_photo(chat_id=1234, photo=2345) end up as effectively:

requests.post(bot.url + 'sendphoto', timeout=12, json={'chat_id': 1234, 'photo': 2345})

which is sent as:

Content-Type: application/json

{"chat_id": 1234, "photo": 2345}

This alternatively could be:

requests.post(bot.url + 'sendphoto', timeout=12, data={'chat_id': 1234, 'photo': 2345})

which would be sent as:

Content-Type: application/x-www-form-urlencoded

chat_id=1234&photo=2345

or:

requests.post(bot.url + 'sendphoto', timeout=12, files={'chat_id': (None, '1234'), 'photo': (None, '2345')})

which would be sent as:

Content-Type: multipart/form-data; boundary=BoUnDaRy

--BoUnDaRy
Content-Disposition: form-data; name="chat_id"

1234
--BoUnDaRy
Content-Disposition: form-data; name="photo"

2345
--BoUnDaRy--

all of which work perfectly fine with Telegram.

However, to actually upload a file, only the third form (files= using multipart/form-data) works:

requests.post(bot.url + 'sendphoto', timeout=12, files={
    'chat_id': (None, '1234'),
    'photo': (None, 'attach://file1'),
    'unused': ('file1', open('myfile.jpg'),
})

which would be sent as:

Content-Type: multipart/form-data; boundary=BoUnDaRy

--BoUnDaRy
Content-Disposition: form-data; name="chat_id"

1234
--BoUnDaRy
Content-Disposition: form-data; name="photo"

attach://file1
--BoUnDaRy
Content-Disposition: form-data; name="unused"; filename="file1"

(contents of myfile.jpg)
--BoUnDaRy--

This can be done relatively painlessly—except when you're mixing both complex data types and arbitrary data. For example:

requests.post(bot.url + 'sendphoto', timeout=12, files={
    'chat_id': (None, '1234'),
    'photo': (None, 'attach://file1'),
    'unused': ('file1', open('myfile.jpg'),
    'caption': (None, 'my text'),
    'entities': (None, ... some representation of [{'type': 'bold', 'offset': 0, 'length': 2}] ...),
})

My initial testing seems to show that telegram-bot-api just magically accepts JSON-encoded data for things like entities, even without Content-Type set to application/json; and similarly does not accept JSON-encoded data for things like text, even with Content-Type set to application/json. That is:

--BoUnDaRy
Content-Disposition: form-data; name="text"

Hello there

is interpreted as the string 'Hello there' while:

--BoUnDaRy
Content-Disposition: form-data; name="text"
Content-Type: application/json

"Hello there"

is interpreted as the string '"Hello there"' (it is not decoded from JSON). Consistently, both:

--BoUnDaRy
Content-Disposition: form-data; name="entities"
Content-Type: application/json

[{"type": "bold", "offset": 0, "length": 2}]

and:

--BoUnDaRy
Content-Disposition: form-data; name="entities"

[{"type": "bold", "offset": 0, "length": 2}]

are interpreted as the structure [{'type': 'bold', 'offset': 0, 'length': 2}] (not the string '[{"type": "bo...).

I've tried digging through TBA's code, which seems to just use TDLib's HttpReader.cpp, which does not seem to have any logic to magically decode entities, values that start with '[', or anything else that would lead to this behavior.

There's relevant discussion somewhat acknowledging this works in tdlib/telegram-bot-api#177, but no explanation of how (or what the rules for an encoder should be—that is, I can't just always JSON-encode values before putting them in files= because strings are not decoded, so I need to know if this is a hard-coded list of fields, or anything that starts with '[', or what).

levlam commented 3 years ago

While parsing multipart/form-data, HttpReader uses special handling only for Content-Type "application/x-www-form-urlencoded". All other values are treated as strings and are used as exact values of the corresponding fields.

The entities field works in the specified way, because it is expected to be a JSON-serialized array as any other Array/Object method parameter.

nmlorg commented 3 years ago

Thanks! The thing I'm having trouble with is that entities seems to work both as a field within a JSON-encoded structure and as a JSON-encoded structure itself — I can send both:

Content-Type: application/json

{"chat_id": 1234, "text": "my text", "entities": [{"type": "bold", "offset": 0, "length": 2}]}

which decodes to:

{
    'chat_id': 1234,
    'entities': [
        {
            'length': 2, 
            'offset': 0, 
            'type': 'bold',
        },
    ],
    'text': 'my text', 
}

and:

Content-Type: application/json

{"chat_id": 1234, "text": "my text", "entities": "[{\"type\": \"bold\", \"offset\": 0, \"length\": 2}]"}

which decodes to:

{
    'chat_id': 1234,
    'entities': '[{"type": "bold", "offset": 0, "length": 2}]',
    'text': 'my text', 
}

I don't see where that second form is being handled in HttpReader (or maybe HttpReader is passing the JSON-encoded string along, and it's being decoded somewhere else?), so I'm not sure if it's specific to entities, or if maybe any string that begins with a '[' gets double decoded, or if I'm misunderstanding what's going on.

nmlorg commented 3 years ago

Just checkpointing where I'm at: As best as I can follow so far, HttpReader::next_read sees Content-Type: application/json and hands off to HttpReader::parse_json_parameters, which decodes '"aaa"' to:

{
    'content': 'aaa',
}

and '{"aaa": bbb, "ccc": ddd}' to:

{
    'aaa': 'bbb',
    'ccc': 'ddd',
}

(i.e. it doesn't actually decode the values in a top-level object, it stores them as their JSON representation). This leads me to believe the first form would be stored as:

{
    'entities': '[{"type": "bold", "offset": 0, "length": 2}]',
}

and the second would be:

{
    'entities': r'"[{\"type\": \"bold\", \"offset\": 0, \"length\": 2}]"',
}

Later, telegram-bot-api/telegram-bot-api/Client.cpp's Client::get_input_message_text(query) calls Client::get_input_entities(query, "entities"), which pulls the still-JSON-encoded string out, and passes that into td/tdutils/td/utils/JsonBuilder.h's json_decode.

I thought there might be a place where json_decode called into itself or one of its helpers in a way that double decoded, but it looks like json_decode calls do_json_decode, which sees the leading '"' and passes control through to json_string_decode.

So for the second form it should be passing a JsonValue with a string value of '[{"type": "bold", "offset": 0, "length": 2}]' to Client::get_formatted_text (as input_entities). Then, since input_entities.type() != JsonValue::Type::Array, it should just be ignored. But it's not. So somewhere before that point there's an extra decode that I'm still not seeing.

levlam commented 3 years ago

This is done in HttpReader::parse_json_parameters. Namely, in

    auto r_value = [&]() -> Result<MutableSlice> {
      if (parser.peek_char() == '"') {
        return json_string_decode(parser);
      } else {
        const int32 DEFAULT_MAX_DEPTH = 100;
        auto begin = parser.ptr();
        auto result = do_json_skip(parser, DEFAULT_MAX_DEPTH);
        if (result.is_ok()) {
          return MutableSlice(begin, parser.ptr());
        } else {
          return result.move_as_error();
        }
      }
    }();

which decodes all JSON strings and keeps any other value as is.

nmlorg commented 3 years ago

Of course! I must have read through that a dozen times and just been blinded once I realized the second case was passing strings through undecoded.

So I think that means, for my request assembler, the rule is: any value in the top-level object that isn't itself a string (so, including bools, for example) can be JSON-encoded before being being [potentially JSON-encoded again when] added to the request.

That should be very simple. Thanks!

nmlorg / ntelebot

Support file uploads #7