Hello everyone!
I faced the problem when try to apply tags from TD::Types::TextEntity.
For example: i send a message:
😀qwerty
I receive a Message with content:
_content=#<TD::Types::MessageContent::Text text=#<TD::Types::FormattedText text="😀qwerty" entities=[#<TD::Types::TextEntity offset=3 length=5 type=#>]> web_page=nil> replymarkup=nil>
It says that in my message bold text begins from symbol n 3. So i insert <b> tag to third symbol of string and the result is
😀qwerty.
In case there's no emoji at the beginning of string, offset is correct.
I send message: qwerty
I receive a Message with content:
_content=#<TD::Types::MessageContent::Text text=#<TD::Types::FormattedText text="qwerty" entities=[#<TD::Types::TextEntity offset=1 length=5 type=#>]> web_page=nil> replymarkup=nil>
So bold text starts with 1st symbol, which is correct.
I dive into this problem and i got this solution. It doesn't look good, i know :) If you have better - let me know.
# https://core.telegram.org/api/entities
#
# The differences in emoji length accounting between TDLib and Ruby
# are due to differences in the handling of surrogate pairs in UTF-16.
# TDlib calculates offset in UTF-16.
# In UTF-16, emoji are often encoded as surrogate pairs
# TDLib surrogate pairs are considered 2 characters, and in Ruby it's one character.
class ApplyEntities
attr_accessor :message, :initial_text
def initialize(message)
@message = message
@initial_text = message.content.text.text
end
def call
surrogate_ind = find_surrogate_pair_ind
entities = message.content.text.entities.map do |e|
opening_tag = entity_types(e)[:opening]
closing_tag = entity_types(e)[:closing]
if opening_tag.present? && closing_tag.present?
# check how may surrogate pairs are before tag
decrease_offset = surrogate_ind.select { |i| i <= e.offset }.size
{
e.offset + e.length - decrease_offset => closing_tag,
e.offset - decrease_offset => opening_tag
}
end
end
tags = entities.compact.reduce({}) do |h, i|
h.merge(i){ |k, prev_tag, next_tag| "#{prev_tag + next_tag}" }
end
tags = tags.sort.reverse.to_h
tagged_text = splice(initial_text, tags)
tagged_text
end
def entity_types(entity)
url = entity.type.url if entity.type.respond_to?(:url)
tags =
{
'TD::Types::TextEntityType::Bold' => {
opening: '<b>',
closing: '</b>'
},
'TD::Types::TextEntityType::TextUrl' => {
opening: "<a href=#{url}>",
closing: '</a>'
},
'TextEntityType::Underline' => {
opening: '<u>',
closing: '</u>'
}
# etc...
}
tags.fetch(entity.type.class.to_s, {})
end
def splice(string, tags)
return string if tags.empty?
result = ''
string.each_char.with_index do |char, ind|
result << tags[ind].to_s << char
end
# covers the case when tag should be placed in the end of string.
# e.g. string length is 8 chars, tag's place is 8.
# string.each_char returns chars with indexes 0..7, there's no 8th place in string.
(string.length..tags.keys.max).each do |ind|
result << tags[ind].to_s
end
result
end
def find_surrogate_pair_ind
# Encodetext to UTF-16
utf16_text = initial_text.encode('UTF-16LE')
indices = []
byte_index = 0
char_index = 0
# Go through each 16-bit word in a string
while byte_index < utf16_text.bytesize
high_surrogate = utf16_text.getbyte(byte_index) | (utf16_text.getbyte(byte_index + 1) << 8)
if byte_index + 2 < utf16_text.bytesize
low_surrogate = utf16_text.getbyte(byte_index + 2) | (utf16_text.getbyte(byte_index + 3) << 8)
end
# Check if current symbol is a surrogate pair
if high_surrogate.between?(0xD800, 0xDBFF) && low_surrogate&.between?(0xDC00, 0xDFFF)
indices << char_index
byte_index += 4 # Skip two 16-bit words (4 bytes)
else
byte_index += 2 # Skip one 16-bit word (2 bytes)
end
char_index += 1
end
indices
end
end
UPD: there's also helpful method client.get_markdown_text(text: message.content.text).value!
Hello everyone! I faced the problem when try to apply tags from>]> web_page=nil> replymarkup=nil>
TD::Types::TextEntity
. For example: i send a message: 😀qwerty I receive a Message with content: _content=#<TD::Types::MessageContent::Text text=#<TD::Types::FormattedText text="😀qwerty" entities=[#<TD::Types::TextEntity offset=3 length=5 type=#It says that in my message bold text begins from symbol n 3. So i insert
<b>
tag to third symbol of string and the result is 😀qwerty.In case there's no emoji at the beginning of string, offset is correct. I send message: qwerty I receive a Message with content: _content=#<TD::Types::MessageContent::Text text=#<TD::Types::FormattedText text="qwerty" entities=[#<TD::Types::TextEntity offset=1 length=5 type=#>]> web_page=nil> replymarkup=nil>
So bold text starts with 1st symbol, which is correct.
I dive into this problem and i got this solution. It doesn't look good, i know :) If you have better - let me know.
UPD: there's also helpful method
client.get_markdown_text(text: message.content.text).value!