php-telegram-bot / core

PHP Telegram Bot based on the official Telegram Bot API
MIT License
3.89k stars 953 forks source link

detect url in caption ? #544

Open sefidpardazesh opened 7 years ago

sefidpardazesh commented 7 years ago

in bot telegram api For text messages we have entity type for detect url, mention, text_mention. But! For photo,video with caption how we detect url,mention.? In other hand how can we use entity type in caption of photo,video?

jacklul commented 7 years ago

Entities are there only for cases when updating messages (that are either html formated or use markdown) so it can be reformatted properly.

There is no such thing for caption, you will have to write a regex for this...

sefidpardazesh commented 7 years ago

thanks. what is reges for mention and text_mention?

KilluaFein commented 7 years ago

Entities are there only for cases when updating messages (that are either html formated or use markdown) so it can be reformatted properly.

@jacklul I'm trying to reformat an edited message, but without success. How can I use the entities to properly reformat?

jacklul commented 7 years ago

@KilluaFein proof of concept:

   private function parseEntitiesString($text, $entities)
    {
        $global_incr = 0;
        foreach ($entities as $entity) {
            if ($entity->getType() == 'italic') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '_', $start, 0);
                $text = $this->mb_substr_replace($text, '_', $end, 0);

                $global_incr = $global_incr + 2;
            } elseif ($entity->getType() == 'bold') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '*', $start, 0);
                $text = $this->mb_substr_replace($text, '*', $end, 0);

                $global_incr = $global_incr + 2;
            } elseif ($entity->getType() == 'code') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '`', $start, 0);
                $text = $this->mb_substr_replace($text, '`', $end, 0);

                $global_incr = $global_incr + 2;
            } elseif ($entity->getType() == 'pre') {
                $start = $global_incr + $entity->getOffset();
                $end = 3 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '```', $start, 0);
                $text = $this->mb_substr_replace($text, '```', $end, 0);

                $global_incr = $global_incr + 6;
            } elseif ($entity->getType() == 'text_link') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();
                $url = '(' . $entity->getUrl() . ')';

                $text = $this->mb_substr_replace($text, '[', $start, 0);
                $text = $this->mb_substr_replace($text, ']' . $url, $end, 0);

                $global_incr = $global_incr + 2 + mb_strlen($url);
            } elseif ($entity->getType() == 'code') {
                $start = $global_incr + $entity->getOffset();

                $text = mb_substr($text, 0, $start);
            }
        }

        return $text;
    }

Never managed to make it work for 100% cases. Multibyte characters break offsets.

KilluaFein commented 7 years ago

Multibyte characters break offsets.

Like emoji, right?

and what is mb_substr_replace()?

KilluaFein commented 7 years ago

offset and length are UTF-16 encoded, maybe a way to convert to UTF-8 to solve this?

jacklul commented 7 years ago

mb_XXX functions are for multi-byte strings (mb I guess).

It took me a lot of time thinking on this and I NEVER found a solution to properly get it to work.

f77 commented 6 years ago
public static function processEntities (string $_text, array $_message_raw): string
    {
        $preset = [
            'bold'      => '<b>%text</b>',
            'italic'    => '<i>%text</i>',
            'text_link' => '<a href="%url">%text</a>',
            'code'      => '<code>%text</code>',
            'pre'       => '<pre>%text</pre>',
        ];

        if (!isset ($_message_raw['entities']))
        {
            return $_text;
        }

        $iterationText = $_text;
        $globalDiff    = 0;
        foreach ($_message_raw['entities'] as $entity)
        {
            $type   = $entity['type'];
            $offset = $entity['offset'] + $globalDiff;
            $length = $entity['length'];

            $pBefore = \mb_substr ($iterationText, 0, $offset);
            $pText   = \mb_substr ($iterationText, $offset, $length);
            $pAfter  = \mb_substr ($iterationText, ($offset + $length));

            // Note: str_replace() works good with utf-8 in the last php versions.
            if (isset ($preset[$type]))
            {
                // Get pattern from the preset.
                $replacedContent = $preset[$type];

                // First, replace url, in that rare case, if in the text will be the %text macros.
                if (!empty ($entity['url']))
                {
                    $replacedContent = \str_replace ('%url', $entity['url'], $replacedContent);
                }

                // Replace main text.
                $replacedContent = \str_replace ('%text', $pText, $replacedContent);

                $newText       = $pBefore . $replacedContent . $pAfter;
                $globalDiff    += (\mb_strlen ($newText) - \mb_strlen ($iterationText));
                $iterationText = $newText;
            }
        }

        return $iterationText;
    }
akalongman commented 6 years ago

@jacklul what is actually a problem? And how to reproduce?

jacklul commented 6 years ago

I believe the point of this issue is to have a way to edit and reformat messags using entities field, because these do not contain formating we have to use 'entities' field for that, I never managed to create a function that could parse this and put into message string correctly because of multibyte strings...

One of simpliest examples would be button under a message that removes or add text to the message while keeping message contents (and that content cannot be obtained/generated in any other way than grabbing it from Message object).

ParachainsDev commented 4 years ago

Any news on this issue? Emojis + text formatting using entities info (offset, length)

noplanman commented 4 years ago

I have a working version (I think), needs some further testing and then I'll release it :+1:

noplanman commented 4 years ago

My latest experiment, which I'll pack into a small package when it works 100%.

Try the class below, and use it like:

$entity_decoder = new EntityDecoder($message, 'markdown'); // or 'html'
$decoded_text   = $entity_decoder->decode();
<?php

use Longman\TelegramBot\Entities\Message;
use Longman\TelegramBot\Entities\MessageEntity;

class EntityDecoder
{
    private $entities;
    private $text;
    private $style;
    private $without_cmd;
    private $offset_correction;

    /**
     * @param Message $message     Message object to reconstruct Entities from.
     * @param string  $style       Either 'html' or 'markdown'.
     * @param bool    $without_cmd If the bot command should be included or not.
     */
    public function __construct(Message $message, string $style = 'html', bool $without_cmd = false)
    {
        $this->entities    = $message->getEntities();
        $this->text        = $message->getText($without_cmd);
        $this->style       = $style;
        $this->without_cmd = $without_cmd;
    }

    public function decode(): string
    {
        if (empty($this->entities)) {
            return $this->text;
        }

        $this->fixBotCommandEntity();

        // Reverse entities and start replacing bits from the back, to preserve offset positions.
        foreach (array_reverse($this->entities) as $entity) {
            $this->text = $this->decodeEntity($entity, $this->text);
        }

        return $this->text;
    }

    protected function fixBotCommandEntity(): void
    {
        // First entity would be the bot command, remove if necessary.
        $first_entity = reset($this->entities);
        if ($this->without_cmd && $first_entity->getType() === 'bot_command') {
            $this->offset_correction = ($first_entity->getLength() + 1);
            array_shift($this->entities);
        }
    }

    /**
     * @param MessageEntity $entity
     *
     * @return array
     */
    protected function getOffsetAndLength(MessageEntity $entity): array
    {
        static $text_byte_counts;

        if (!$text_byte_counts) {
            // https://www.php.net/manual/en/function.str-split.php#115703
            $str_split_unicode = preg_split('/(.)/us', $this->text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

            // Generate an array of UTF-16 encoded string lengths, which is necessary
            // to correct the offset and length values of special characters, like Emojis.
            $text_byte_counts = array_map(function ($char) {
                return strlen(mb_convert_encoding($char, 'UTF-16', 'UTF-8')) / 2;
            }, $str_split_unicode);
        }

        $offset = $entity->getOffset() - $this->offset_correction;
        $length = $entity->getLength();

        $offset += $offset - array_sum(array_slice($text_byte_counts, 0, $offset));
        $length += $length - array_sum(array_slice($text_byte_counts, $offset, $length));

        return [$offset, $length];
    }

    /**
     * @param string $style
     * @param string $type
     *
     * @return string
     */
    protected function getFiller(string $style, string $type): string
    {
        $fillers = [
            'html'     => [
                'text_mention' => '<a href="tg://user?id=%2$s">%1$s</a>',
                'text_link'    => '<a href="%2$s">%1$s</a>',
                'bold'         => '<b>%s</b>',
                'italic'       => '<i>%s</i>',
                'code'         => '<code>%s</code>',
                'pre'          => '<pre>%s</pre>',
            ],
            'markdown' => [
                'text_mention' => '[%1$s](tg://user?id=%2$s)',
                'text_link'    => '[%1$s](%2$s)',
                'bold'         => '*%s*',
                'italic'       => '_%s_',
                'code'         => '`%s`',
                'pre'          => '```%s```',
            ],
        ];

        return $fillers[$style][$type] ?? '';
    }

    /**
     * Decode an entity into the passed string.
     *
     * @param MessageEntity $entity
     * @param string        $text
     *
     * @return string
     */
    private function decodeEntity(MessageEntity $entity, string $text): string
    {
        [$offset, $length] = $this->getOffsetAndLength($entity);

        $text_bit = $this->getTextBit($entity, $offset, $length);

        // Replace text bit.
        return mb_substr($text, 0, $offset) . $text_bit . mb_substr($text, $offset + $length);
    }

    /**
     * @param MessageEntity $entity
     * @param int           $offset
     * @param int           $length
     *
     * @return false|string
     */
    private function getTextBit(MessageEntity $entity, $offset, $length)
    {
        $type     = $entity->getType();
        $filler   = $this->getFiller($this->style, $type);
        $text_bit = mb_substr($this->text, $offset, $length);

        switch ($type) {
            case 'text_mention':
                $text_bit = sprintf($filler, $text_bit, $entity->getUser()->getId());
                break;
            case 'text_link':
                $text_bit = sprintf($filler, $text_bit, $entity->getUrl());
                break;
            case 'bold':
            case 'italic':
            case 'code':
            case 'pre':
                $text_bit = sprintf($filler, $text_bit);
                break;
            default:
                break;
        }

        return $text_bit;
    }
}
ParachainsDev commented 4 years ago

My latest experiment, which I'll pack into a small package when it works 100%.

Tested and do not see problems. A lot of emojis and different formatting works ok at the first glance.

Oreolek commented 6 months ago

All code snippets in this thread utterly fail on underline text inside spoilers. (HTML mode)

UPD: Use https://packagist.org/packages/lucadevelop/telegram-entities-decoder