html-to-text / node-html-to-text

Advanced html to text converter
Other
1.6k stars 224 forks source link

Could encodeCharacters support emojis? #267

Closed webstech closed 1 year ago

webstech commented 1 year ago

I thought it might be handy to use the encodeCharacters option to change emojis into short codes.

Tried the following:

    encodeCharacters: {
            "😀": ":smiley:"
    },

The output was :smiley:�. Tried changing the code to use codePointAt with no success. I may have done it wrong but the regex looked somewhat reasonable.

Can emojis be supported for this? They are larger numbers (ie >4 hex digits).

Full disclosure: I have no actual requirement for this and just thought I would try it in the test code. I am raising the question so anyone else with the same idea will find the answer.

KillyMXI commented 1 year ago

The purpose of this feature - safely escape characters that might have special meaning in output text. More specifically, I needed a way to prevent markdown from breaking by a rogue unescaped _ or | for example. Perfect solution would've only escaped characters in contexts where they are ambiguous, but I don't know how to achieve that yet.

The reason it doesn't work with smileys: Regex only takes a single character from a key. Smileys are not single characters, and some of them not even a single code point. It would actually require a significant effort to properly isolate any of them. (Intl.Segmenter only comes with Node 16 and it would mean I have to segment each text fragment just to check whether there is any symbol to replace.)

How replacer is constructed: https://github.com/html-to-text/node-html-to-text/blob/2b18dacdaa5a356afe24487e4ba12556be1376f3/packages/base/src/index.js#L164-L185 The reason to encode characters is to produce a guaranteed valid regex without worrying about specific escape rules. I can potentially encode whole keys - that should make it work with smileys. There is one caveat though - it won't be obvious to the programmer, but the order of keys becomes important: if there is any key that is a substring of another key (or can partially overlap another key) - it should not appear first. That includes basic and compound emojis. Perhaps, I can sort the entries array by the key length (descending) to avoid the risk of substrings matching first. But the risk of partial overlap remains, even if less likely in practice. That would also make the name encodeCharacters inaccurate to what it does.

I may go for it once I see more practical demand.

For now, replacing smileys can be safely done after conversion, because it is less likely to mess with text markup. The only issue there is that wordwrap length might not be respected afterwards. (In fact, the line length is computed based on string length and can be miscounted in opposite way for compound emojis.)

KillyMXI commented 1 year ago

Now I figured that, while taking one character is sufficient for my purpose, with a bit more effort I can get and encode the first code point. Sounds like a right thing to do. Coincidentally, it will make it work for many emojis. But not for compound ones.

After switching to Node 16 next year, it will be possible to apply Intl.Segmenter to the dictionary keys to get the first symbol. That might solve it for all emojis without noticeable performance impact.

The risk of replacing a part of a compound emoji if only that part was specified in the dictionary - remains either way. Not sure this is fixable.

image

KillyMXI commented 1 year ago

I published 9.0.2 with the change. Now it takes the first code point.

While at it, I also allowed values in the dictionary to be string | false - this is mainly for html to md package that will come with some defaults, to make a cleaner way to disable them.

webstech commented 1 year ago

Thank you very much. It is working. I appreciate the quick response.