Offset value is incorrect is case Message text contains emoji

Hello everyone! I faced the problem when try to apply tags from TD::Types::TextEntity. For example: i send a message: 😀qwerty I receive a Message with content: _content=#<TD::Types::MessageContent::Text text=#<TD::Types::FormattedText text="😀qwerty" entities=[#<TD::Types::TextEntity offset=3 length=5 type=#>]> web_page=nil> replymarkup=nil>

It says that in my message bold text begins from symbol n 3. So i insert <b> tag to third symbol of string and the result is 😀qwerty.

In case there's no emoji at the beginning of string, offset is correct. I send message: qwerty I receive a Message with content: _content=#<TD::Types::MessageContent::Text text=#<TD::Types::FormattedText text="qwerty" entities=[#<TD::Types::TextEntity offset=1 length=5 type=#>]> web_page=nil> replymarkup=nil>

So bold text starts with 1st symbol, which is correct.

I dive into this problem and i got this solution. It doesn't look good, i know :) If you have better - let me know.

messages = client.get_chat_history(chat_id: 'chat_id', from_message_id: 0, offset: 0, limit: 5, only_local: false).wait
message = messages.value.messages.last
ApplyEntities.new(message).call

# https://core.telegram.org/api/entities
#
# The differences in emoji length accounting between TDLib and Ruby
# are due to differences in the handling of surrogate pairs in UTF-16.
# TDlib calculates offset in UTF-16.
# In UTF-16, emoji are often encoded as surrogate pairs
# TDLib surrogate pairs are considered 2 characters, and in Ruby it's one character.
class ApplyEntities
  attr_accessor :message, :initial_text

  def initialize(message)
    @message = message
    @initial_text = message.content.text.text
  end

  def call
    surrogate_ind = find_surrogate_pair_ind
    entities =  message.content.text.entities.map do |e|
      opening_tag = entity_types(e)[:opening]
      closing_tag = entity_types(e)[:closing]

      if opening_tag.present? && closing_tag.present?
        # check how may surrogate pairs are before tag
        decrease_offset = surrogate_ind.select { |i| i <= e.offset }.size
        {
          e.offset + e.length - decrease_offset => closing_tag,
          e.offset - decrease_offset => opening_tag
        }
      end
    end

    tags = entities.compact.reduce({}) do |h, i|
      h.merge(i){ |k, prev_tag, next_tag| "#{prev_tag + next_tag}" }
    end
    tags = tags.sort.reverse.to_h
    tagged_text = splice(initial_text, tags)

    tagged_text
  end

  def entity_types(entity)
    url = entity.type.url if entity.type.respond_to?(:url)

    tags =
      {
        'TD::Types::TextEntityType::Bold' => {
          opening: '<b>',
          closing: '</b>'
        },
        'TD::Types::TextEntityType::TextUrl' => {
          opening: "<a href=#{url}>",
          closing: '</a>'
        },
        'TextEntityType::Underline' => {
          opening: '<u>',
          closing: '</u>'
        }
        # etc...
    }

    tags.fetch(entity.type.class.to_s, {})
  end

  def splice(string, tags)
    return string if tags.empty?

    result = ''
    string.each_char.with_index do |char, ind|
      result << tags[ind].to_s << char
    end
    # covers the case when tag should be placed in the end of string.
    # e.g. string length is 8 chars, tag's place is 8.
    # string.each_char returns chars with indexes 0..7, there's no 8th place in string.
    (string.length..tags.keys.max).each do |ind|
      result << tags[ind].to_s
    end

    result
  end

  def find_surrogate_pair_ind
    # Encodetext to UTF-16
    utf16_text = initial_text.encode('UTF-16LE')
    indices = []
    byte_index = 0
    char_index = 0

    # Go through each 16-bit word in a string
    while byte_index < utf16_text.bytesize
      high_surrogate = utf16_text.getbyte(byte_index) | (utf16_text.getbyte(byte_index + 1) << 8)
      if byte_index + 2 < utf16_text.bytesize
        low_surrogate = utf16_text.getbyte(byte_index + 2) | (utf16_text.getbyte(byte_index + 3) << 8)
      end

      # Check if current symbol is a surrogate pair
      if high_surrogate.between?(0xD800, 0xDBFF) && low_surrogate&.between?(0xDC00, 0xDFFF)
        indices << char_index
        byte_index += 4 # Skip two 16-bit words (4 bytes)
      else
        byte_index += 2 # Skip one 16-bit word (2 bytes)
      end
      char_index += 1
    end

    indices
  end
end

UPD: there's also helpful method client.get_markdown_text(text: message.content.text).value!

southbridgeio / tdlib-ruby

Offset value is incorrect is case Message text contains emoji #66