taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

Emojis are counted as one char, how to make them count as two? #192

Closed VityaSchel closed 2 years ago

VityaSchel commented 2 years ago

Hello, I'm using this parser for Telegram MTProto entity tags and Telegram requires offset and length for each entity tag (such as boldness or underscore text), however emojis are counted as two chars in Telegram because I suppose they count code units instead of actual chars. How can I count emojis as two chars?

VityaSchel commented 2 years ago

Closing because I found a solution: replace emojis before parsing with any regular three chars, then send to telegram this text but with replaced three chars to emoji:

const originalText = 'Hello&!M<b>world</b>'

function convertHTMLToEntities(root, element = root) {
  let entities = []
  for(let child of element.childNodes) {
    if(child.constructor.name === 'HTMLElement') {
      const difference = start => {
        const htmlBeforeStart = root.outerHTML.substring(0, start+1)
        return htmlBeforeStart.length - stripHtml(htmlBeforeStart).result.length
      }

      entities.push({
        _: entitiesMapping[child.rawTagName],
        offset: child.range[0] - difference(child.range[0]),
        length: child.innerText.length,
        ...(child.rawTagName === 'a' && { url: child.getAttribute('href') })
      })
      if(child.childNodes) entities.push(...convertHTMLToEntities(root, child))
    }
  }
  return entities
}

const parsedText = parse(text)
const entities = convertHTMLToEntities(parsedText)
const textToSend = parsedText.innerText.replaceAll('&!M', '🌚')