tdlib / telegram-bot-api

Telegram Bot API server
https://core.telegram.org/bots
Boost Software License 1.0
3.17k stars 597 forks source link

Inconsistencies in handling of block quotes and pre-formatted code blocks #515

Closed Bibo-Joshi closed 8 months ago

Bibo-Joshi commented 9 months ago

Hi there. First of all, let me say that I'm exited for the API 7.0 release! It brings functionality that's been eagerly awaited from the community The team of python-telegram-bot.org is currently working on integrating these changes into the python library. During this process, I noticed that the new block quote formatting options shows several inconsistencies in the handling of line breaks. The key points are

The second point is unfortunate, but I can understand that HTML is in general just more flexible than MD. The first point may not seem to bad on first glance, because a user can still generate the rendered result they like. However, the inconsistencies make it rather difficult to work with updating messages, parsing & combining content of several messages and similar use cases. As notable example, python-telegram-bot provides utility functionality that tages a Message object and computes text including formatting markers in the desired markup language, with the idea being that

await bot.send_message(chat_id=chat_id, text=message.text_html/markdown_v2, parse_mode=ParseMode.HTM/MARKDOWN_V2)

produces the same rendered result as visible for message itself. This output can then be used to process the message further within Telegram or e.g. render the message in external programs. The observed inconsistencies make the implementation hard for us (see also discussion in https://github.com/python-telegram-bot/python-telegram-bot/pull/4038)

Let me describe the observed inconsistencies in code. The reference implementation is written for python-telegram-bot version 20.7, but ofc the same results can be achieved with plain HTTP requests.

import asyncio

from telegram import Bot, Message
from telegram.constants import ParseMode

async def main():
    data: list[list[tuple[str, ParseMode]]] = [
        # These produce the same rendered result but different entities
        [
            (">A\nB", ParseMode.MARKDOWN_V2),
            ("<blockquote>A</blockquote>\nB", ParseMode.HTML),
        ],
        # These produce the same rendered result but different entities
        # The first HTML version also produces a different reported message text
        [
            ("ABC\n>DEF\nGHI", ParseMode.MARKDOWN_V2),
            ("ABC<blockquote>DEF</blockquote>GHI", ParseMode.HTML),
            ("ABC\n<blockquote>DEF</blockquote>\nGHI", ParseMode.HTML),
        ],
        # The Markdown V2 results are different from each other and from the HTML result
        # In particular, the HTML result can not be achieved with Markdown V2
        # The HTML versions produce the same rendered result but different entities
        # The reported message text coincides only for the first and third case
        [
            (">ABC\n>DEF\n>GHI", ParseMode.MARKDOWN_V2),
            (">ABC\n\n>DEF\n\n>GHI", ParseMode.MARKDOWN_V2),
            (
                "<blockquote>ABC</blockquote>\n<blockquote>DEF</blockquote>\n<blockquote>GHI"
                "</blockquote>",
                ParseMode.HTML,
            ),
            (
                "<blockquote>ABC</blockquote><blockquote>DEF</blockquote><blockquote>GHI"
                "</blockquote>",
                ParseMode.HTML,
            ),
        ],
    ]

    async with Bot("TOKEN") as bot:
        for case in data:
            messages: list[Message] = []
            for text, pars_mode in case:
                messages.append(
                    await bot.send_message(
                        chat_id=chat_id,
                        text=text,
                        parse_mode=pars_mode,
                    )
                )

            for message in messages:
                print(repr(message.text))
            for message in messages:
                print(message.entities)

            await bot.send_message(
                chat_id=chat_id,
                text=20 * "-",
            )
            print(20 * "-")

if __name__ == "__main__":
    asyncio.run(main())

Output:

'A\nB'
'A\nB'
(MessageEntity(length=2, offset=0, type=<MessageEntityType.BLOCKQUOTE>),)
(MessageEntity(length=1, offset=0, type=<MessageEntityType.BLOCKQUOTE>),)
--------------------
'ABC\nDEF\nGHI'
'ABCDEFGHI'
'ABC\nDEF\nGHI'
(MessageEntity(length=4, offset=4, type=<MessageEntityType.BLOCKQUOTE>),)
(MessageEntity(length=3, offset=3, type=<MessageEntityType.BLOCKQUOTE>),)
(MessageEntity(length=3, offset=4, type=<MessageEntityType.BLOCKQUOTE>),)
--------------------
'ABC\nDEF\nGHI'
'ABC\n\nDEF\n\nGHI'
'ABC\nDEF\nGHI'
'ABCDEFGHI'
(MessageEntity(length=11, offset=0, type=<MessageEntityType.BLOCKQUOTE>),)
(MessageEntity(length=4, offset=0, type=<MessageEntityType.BLOCKQUOTE>), MessageEntity(length=4, offset=5, type=<MessageEntityType.BLOCKQUOTE>), MessageEntity(length=3, offset=10, type=<MessageEntityType.BLOCKQUOTE>))
(MessageEntity(length=3, offset=0, type=<MessageEntityType.BLOCKQUOTE>), MessageEntity(length=3, offset=4, type=<MessageEntityType.BLOCKQUOTE>), MessageEntity(length=3, offset=8, type=<MessageEntityType.BLOCKQUOTE>))
(MessageEntity(length=3, offset=0, type=<MessageEntityType.BLOCKQUOTE>), MessageEntity(length=3, offset=3, type=<MessageEntityType.BLOCKQUOTE>), MessageEntity(length=3, offset=6, type=<MessageEntityType.BLOCKQUOTE>))
--------------------

Screenshot from Telegram Desktop Windows (Version 4.14.3 x64)

image

While investigating these inconsistencies, I became aware that some of them in fact already apply for the pre-formatted code blocks, which so far we just hadn't noticed. Let me demonstrate also this.

In the above reference implementation, replace the data with

data: list[list[tuple[str, ParseMode]]] = [
        # These produce the same rendered result and the same entities
        [
            ("```A```\nB", ParseMode.MARKDOWN_V2),
            ("```\nA```\nB", ParseMode.MARKDOWN_V2),
            ("<pre>A</pre>\nB", ParseMode.HTML),
        ],
        # These produce the same rendered result but the second version produces different entities
        # and different reported message text
        [
            ("ABC\n```DEF```\nGHI", ParseMode.MARKDOWN_V2),
            ("ABC<pre>DEF</pre>GHI", ParseMode.HTML),
            ("ABC\n<pre>DEF</pre>\nGHI", ParseMode.HTML),
        ],
        # These produce the same rendered result, but the reported entities and message text depend
        # on whether the newlines characters are present
        [
            ("```ABC```\n```DEF```\n```GHI```", ParseMode.MARKDOWN_V2),
            ("```ABC``````DEF``````GHI```", ParseMode.MARKDOWN_V2),
            (
                "<pre>ABC</pre>\n<pre>DEF</pre>\n<pre>GHI</pre>",
                ParseMode.HTML,
            ),
            (
                "<pre>ABC</pre><pre>DEF</pre><pre>GHI</pre>",
                ParseMode.HTML,
            ),
        ],
    ]

Output:

'A\nB'
'A\nB'
'A\nB'
(MessageEntity(length=1, offset=0, type=<MessageEntityType.PRE>),)
(MessageEntity(length=1, offset=0, type=<MessageEntityType.PRE>),)
(MessageEntity(length=1, offset=0, type=<MessageEntityType.PRE>),)
--------------------
'ABC\nDEF\nGHI'
'ABCDEFGHI'
'ABC\nDEF\nGHI'
(MessageEntity(length=3, offset=4, type=<MessageEntityType.PRE>),)
(MessageEntity(length=3, offset=3, type=<MessageEntityType.PRE>),)
(MessageEntity(length=3, offset=4, type=<MessageEntityType.PRE>),)
--------------------
'ABC\nDEF\nGHI'
'ABCDEFGHI'
'ABC\nDEF\nGHI'
'ABCDEFGHI'
(MessageEntity(length=3, offset=0, type=<MessageEntityType.PRE>), MessageEntity(length=3, offset=4, type=<MessageEntityType.PRE>), MessageEntity(length=3, offset=8, type=<MessageEntityType.PRE>))
(MessageEntity(length=3, offset=0, type=<MessageEntityType.PRE>), MessageEntity(length=3, offset=3, type=<MessageEntityType.PRE>), MessageEntity(length=3, offset=6, type=<MessageEntityType.PRE>))
(MessageEntity(length=3, offset=0, type=<MessageEntityType.PRE>), MessageEntity(length=3, offset=4, type=<MessageEntityType.PRE>), MessageEntity(length=3, offset=8, type=<MessageEntityType.PRE>))
(MessageEntity(length=3, offset=0, type=<MessageEntityType.PRE>), MessageEntity(length=3, offset=3, type=<MessageEntityType.PRE>), MessageEntity(length=3, offset=6, type=<MessageEntityType.PRE>))
--------------------

Screenshot from Telegram Desktop Windows (Version 4.14.3 x64)

image


I'm aware that the formatting functionality can be viewed as "working as expected" and one can argue that parsing problems of the entities should be handled by any Bot API wrapper itself. Still, I want to point out these discrepancies to you and emphasize that IMO it consistent parsing (same rendered results have same entities & text) would improve the usability of the Bot API .

levlam commented 9 months ago

This output can then be used to process the message further within Telegram or e.g. render the message in external programs.

It is very unlikely that you can use Markdown output in external programs. For internal usages in most cases it is much better to manually specify text entities instead of trying to construct corresponding Markdown/HTML markup. This is supported for more than 3 years now and should be used instead of text_markdown/text_html.

# These produce the same rendered result but different entities

Blockquotes and pre-formatted blocks should start on a new line and end before a new line. If they aren't then apps will still show them as if they are, but it is up to the app, how this is achieved. You may see a different number of empty lines in different places in different apps in the latter case.

Bibo-Joshi commented 9 months ago

Thanks for the swift reply.

It is very unlikely that you can use Markdown output in external programs.

Copy-pasting the code snippets from https://core.telegram.org/bots/api#formatting-options to both GitHub and StackOverflow, I can see that two widely-used interpreters can display most parts of TGs formatting options correctly without a need for changes. In fact, the need to adapt/leave out some of TGs formatting options for processing in external programs actually highlights the use cases of methods like text_md/html. If you want to display TG messages in reStructuredText, Ascii-doc, or other formats, you'll have to translate the entities into markup symols or at least in other entities-like datastructures that the external program can reliably understand.

For internal usages in most cases it is much better to manually specify text entities instead of trying to construct corresponding Markdown/HTML markup. This is supported for more than 3 years now and should be used instead of text_markdown/text_html.

I see a "in most cases" there 😉 A use case that I have in mind is adding a prefix-text to an existing message, where both the prefix-text and the message contain formatting entities. Having to shift the entities in the messages by a computed offset and calculating offset+length for the prefix-text, all in utf-16, is far more implementation effort than simply writing html/md_formatted_prefix + message.text_markdown/html. Nevertheless, if this is TGs official standpoint, we'll have to evaluate how much maintenance overhead text_html/markdown is worth for us.

Markdown *bold \*text* _italic \*text_ __underline__ ~strikethrough~ ||spoiler|| *bold _italic bold ~italic bold strikethrough ||italic bold strikethrough spoiler||~ __underline italic bold___ bold* [inline URL](http://www.example.com/) [inline mention of a user](tg://user?id=123456789) ![👍](tg://emoji?id=5368324170671202286) `inline fixed-width code` ``` pre-formatted fixed-width code block ``` ```python pre-formatted fixed-width code block written in the Python programming language ``` >Block quotation started >Block quotation continued >The last line of the block quotation --- HTML bold, bold italic, italic underline, underline strikethrough, strikethrough, strikethrough spoiler, spoiler bold italic bold italic bold strikethrough italic bold strikethrough spoiler underline italic bold bold inline URL inline mention of a user 👍 inline fixed-width code
pre-formatted fixed-width code block
pre-formatted fixed-width code block written in the Python programming language
Block quotation started\nBlock quotation continued\nThe last line of the block quotation
--- Same on SO ![image](https://github.com/tdlib/telegram-bot-api/assets/22366557/f23a1c91-401f-4c8b-bd7c-aa53a2180938)

Blockquotes and pre-formatted blocks should start on a new line and end before a new line. If they aren't then apps will still show them as if they are, but it is up to the app, how this is achieved.

I would be very glad if this was documented at https://core.telegram.org/bots/api#formatting-options. This would alleviate at least some of the inconsistencies. Frankly, questions on the Bot API arising from shifting responsibility to the client without having that documented in the API is establishing itself as a pattern (See also #429 and #428) :/

levlam commented 9 months ago

can display most parts of TGs formatting options correctly without a need for changes.

But definitely not all of them. There will be multiple issues, for example, with bold/italic, blockquotes, or character escaping. Not even mentioning that classic Markdown and HTML are space-agnostic and Bot API isn't.

A use case that I have in mind is adding a prefix-text to an existing message

This can be easily done by shifting entities and library can provide a helper for that, which receives two text with entities and returns their concatenation. Implementation of such function is much simpler than trying to revert Markdown formatting.

I would be very glad if this was documented at https://core.telegram.org/bots/api#formatting-options.

This isn't a strong requirement, otherwise, it would be checked server-side. I also can't guarantee that non-official apps will correctly move the blocks to a new paragraph as intended.

Bibo-Joshi commented 9 months ago

But definitely not all of them. There will be multiple issues

I agree that displaying tg messages in external programs can not reliably work without the need for some adaption. But as explained above, this makes text_md/html more valuable, not less valuable.

library can provide a helper for that, which receives two text with entities and returns their concatenation.

This still leaves the user with having to construct entity objects for the prefix and calling a method with 4 arguments, while string concatenation is a more simple operation.

This isn't a strong requirement, otherwise, it would be checked server-side. I also can't guarantee that non-official apps will correctly move the blocks to a new paragraph as intended.

If behavior of non-official clients is expected to have problems with block entities that do not start/end on a new line, wouldn't this be even more reason to document this, at least as a guideline?


In any case, it become clear that you don't see a need to improve the inconsistencies and we'll have to live with that.

levlam commented 9 months ago

In any case, it become clear that you don't see a need to improve the inconsistencies and we'll have to live with that.

There is no way to improve the way the resulting message may look, because this depend on the user's app. But consecutive block quotes without separating blank line could be supported in Bot API.

levlam commented 8 months ago

The ability to create consequent quotes using MarkdownV2 was added in Bot API 7.1 using a zero-length entity or separators between the entities. The corresponding examples were added to the documentation.