shshaw / Splitting

JavaScript microlibrary to split an element by words, characters, children and more, populated with CSS variables!
https://splitting.js.org
MIT License
1.67k stars 68 forks source link

Emoji/Unicode Range Support #25

Open notoriousb1t opened 5 years ago

notoriousb1t commented 5 years ago

Splitting does not appear to work with certain unicode ranges. ⚡️works but some other emojis do not. Maybe this is an issue with "".split()

shshaw commented 5 years ago

Yes; I assume there's some unicode issues. It may be best to split on /\S/. Not sure if that would help.

shshaw commented 5 years ago

Created an emoji-support branch to work through this.

From our conversation:

shshaw [10:30 AM] Lodash does seem to work. https://codepen.io/shshaw/pen/451e393401663892e0fee944575d4bd2 In total, lodash is ~4kb gzipped… I wonder with treeshaking how small just toArray could be chars = _.toArray(wholeText)

notoriousb1t [10:36 AM] it probably wouldn't add a whole lot

shshaw [10:37 AM] We could potentially simplify the logic overall with it

shshaw [10:37 AM] There’s a lodash-es for an ES6 version of lodash That may help with treeshaking It may be this simple: https://www.neontsunami.com/posts/allow-treeshaking-with-lodash

shshaw commented 5 years ago

This seems to be the source for lodash's emoji-processing toArray, for reference: https://github.com/lodash/lodash/blob/4ea8c2ec249be046a0f4ae32539d652194caf74f/.internal/unicodeToArray.js

In theory we could probably simplify from that, but ideally we could just import that and/or the stringToArray function with treeshaking and not have to maintain the unicode/RegEx: https://github.com/lodash/lodash/blob/4ea8c2ec249be046a0f4ae32539d652194caf74f/.internal/stringToArray.js

jhnsnc commented 5 years ago

Unfortunately if you want to capture the full nuance of emoji sequences, you end up needing to do something at least as complex as lodash's unicodeToArray.

You can go with some simpler options if you're okay with some broken edge cases.

notoriousb1t commented 5 years ago

I think the first goal is to improve support for it, not necessarily to support all nuances of it.

shshaw commented 5 years ago

Yes. Goal wouldn't necessarily be complete support of all permutations, but widest support at the smallest file size. Looks like Lodash's regex could compress down to about 628 bytes (223 gzipped), so that's the goal to beat.

jhnsnc commented 5 years ago

:shipit:

shshaw commented 5 years ago

https://emojipedia.org/zero-width-joiner/ For further research

shshaw commented 5 years ago

https://emojipedia.org/emoji-zwj-sequences/

bastienrobert commented 4 years ago

I think this could be an interesting point about char splitting for unicodes/emojis: https://stackoverflow.com/a/38901550/7355534

shshaw commented 4 years ago

Great reference! Thank you.

shshaw commented 4 years ago

Reference https://thekevinscott.com/emojis-in-javascript/

shshaw commented 3 years ago

Reference: https://github.com/davatron5000/Lettering.js/blob/a4c6b18c28ecc50675937b10e88328473dbb15ce/jquery.lettering.js#L34

ste-vg commented 1 month ago

This is fixed in version 1.1.0!

https://codepen.io/ste-vg/pen/QWRpGxj