/what-every-javascript-developer-should-know-about-unicode/

panzerdp commented 3 years ago

Written on 09/04/2016 16:26:14

URL: https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/

panzerdp commented 3 years ago

Comment written by paulettom on 09/15/2016 08:30:49

Thank you so much for this 👍

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 09/15/2016 09:09:36

Thanks, enjoy the Unicode! 👽

panzerdp commented 3 years ago

Comment written by Tauque on 09/19/2016 07:19:30

Really thank you for this thorough article, it helps so much.
Your blog is a pleasure to read!
It is not very pratical for an asian developer to keep his code only ASCII though.

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 09/19/2016 11:11:08

Hello @tauque:disqus,
Indeed Unicode solves almost everything related to extra-ASCII chars. It just requires some level of understanding.
Thank you!

panzerdp commented 3 years ago

Comment written by Vadim on 09/21/2016 10:26:12

Brilliant article. Thank you.

PS: There is one lttle typo:
"[...smile] returns an array of symbols that omega string contains"
should be replcaed to:
"[...omega] returns an array of symbols that omega string contains"

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 09/21/2016 10:40:07

Thanks @disqus_AUv5qFwctk:disqus for catching up the typo! Article updated.

panzerdp commented 3 years ago

Comment written by Binh Thanh Nguyen on 09/23/2016 09:04:36

Thanks, a really helpful article!

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 09/23/2016 15:02:07

Awesome :) I'm glad you find it helpful.

panzerdp commented 3 years ago

Comment written by Abhisek P. on 09/23/2016 21:42:53

I feel like a master of Unicode ;)

panzerdp commented 3 years ago

Comment written by Abhisek P. on 09/23/2016 21:46:59

> The **confusing** appears when developer thinks that strings are composed of graphemes (or symbols), ignoring the code unit sequence concept.

[confusing] -> confusion

> **Its** shorter than indicating the high-surrogate and low-surrogate pair

[Its] -> It's

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 09/24/2016 05:34:16

Thanks @abhisekp:disqus, the typos fixed.

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 09/24/2016 05:34:29

That's great!

panzerdp commented 3 years ago

Comment written by Demian Uberti on 09/28/2016 12:56:13

you have a tipo there too
"replcaed" :P

panzerdp commented 3 years ago

Comment written by nevf on 09/28/2016 22:41:10

Thank you for this detailed and excellent article on understanding Unicode and with Javascript.

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 10/10/2016 13:23:21

That's great @nevf:disqus! I'm glad you like the article.

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 02/21/2017 17:46:37

Indeed Code point escape sequences are much easier to write, especially for characters outside BMP.

panzerdp commented 3 years ago

Comment written by Антон Шарыгин on 03/04/2018 12:03:22

Дима, спасибо за твои статьи! Доступно, понятно и с иронией :) Низкий поклон

panzerdp commented 3 years ago

Comment written by Ishan Test on 06/17/2018 10:25:29

Brilliant article, helped really well to understand basics. Thank you very much

panzerdp commented 3 years ago

Comment written by M Patil on 07/05/2018 10:12:25

Good 1.. Gr8

panzerdp commented 3 years ago

Comment written by Mithoon Kumar on 07/21/2018 14:10:41

Awesome article.

panzerdp commented 3 years ago

Comment written by Kostya Hmelnitski on 09/23/2018 19:03:11

//Get code point number using number = myString.codePointAt(index), then...

--Actually, codePointAt(index) is only partially U+10000+ Unicode-aware: it starts reading at (2*index) byte's offset into the string - exactly like charCodeAt(index) does. No matter what's in the beginning - usual two-byte characters or extended Unicodes with two surrogate halves.
So the codePointAt() improperly calculates the "index" (not the number of preceding Unicodes but the number of two-byte characters and/or surrogates).
It can point to a low-surrogate code unit of an SMP Unicode, then you'll get its [\uDC00-\uDFFF] code as a return value:

var omega = '\u{1D6C0}\u{1D6C0}\u{1D6C0}\u{1D6C0}!'; // => "𝛀𝛀𝛀𝛀!"
String.fromCodePoint(omega.codePointAt(0)) // => "𝛀"
String.fromCodePoint(omega.codePointAt(1)) // => "\udec0"
...
String.fromCodePoint(omega.codePointAt(6)) // => "𝛀"
String.fromCodePoint(omega.codePointAt(7)) // => "\udec0"
String.fromCodePoint(omega.codePointAt(8)) // => "!"
Cp.:
[...omega][3] // => "𝛀", [...omega].join('+'): "𝛀+𝛀+𝛀+𝛀+!"
[...omega][4] // => "!"

panzerdp commented 3 years ago

Comment written by tomwang on 09/30/2018 09:24:20

very nice!

panzerdp commented 3 years ago

Comment written by Eric Xu on 03/22/2019 16:28:46

Brilliant!

panzerdp commented 3 years ago

Comment written by Matt Langston on 06/17/2019 18:08:00

This was a very helpful article; thank you so much for taking the time to write it!

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 06/18/2019 09:04:11

Thanks @mattlangston:disqus!

panzerdp commented 3 years ago

Comment written by Jason Khanlar on 06/22/2019 05:17:58

This is my favorite web page!

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 07/05/2019 16:59:32

Спасибо!

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 07/05/2019 17:00:02

Thanks!

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 07/05/2019 17:00:15

Tnks.

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 07/05/2019 17:00:50

Good to know! Thanks for describind this special case.

panzerdp commented 3 years ago

Comment written by Solomon Rutzky on 07/23/2019 06:58:59

Hi Dmitri. Nice post. Just a few notes:

1) Please replace all occurrences of "BPM" with "BMP".

2) Please replace all occurrences of "astral" with "supplementary" (a few might need to be removed). The term "astral" is never used in Unicode. It was suggested as a possible name back in 2000 before Unicode officially settled on "supplementary" in late September of 2000. Please see the following two archived list-serve posts:
a) from Kenneth Whistler, 2000-09-12
b) from Asmus Freytag, 2000-09-29

3) While \uD83D\uDE07 is probably easier, for completeness you could mention String.fromCharCode(0xD83D, 0xDE07)

4) For more info on Unicode escape sequences (though I think you covered everything for JavaScript), please visit: Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 07/23/2019 10:45:24

Thanks @solomonrutzky:disqus for the nice tips! I will update the post.

panzerdp commented 3 years ago

Comment written by Solomon Rutzky on 07/23/2019 13:37:11

You are welcome. Thanks for providing a comprehensive resource 😺. Also, I just noticed that the "latest version" info is now out of date. As of May, 2019 the most recent version is 12.1 which contains 137,929 characters. Please see the official announcement:

Unicode Version 12.1 released in support of the Reiwa Era

panzerdp commented 3 years ago

Comment written by S. Hristov on 08/24/2020 11:38:57

Nice article, thank you!
const str = 'cat\u{1F639}';
console.log(str); // => 'cat😹'
console.log(str.length); // => 5
When str string is rendered, it contains 4 symbols cat😹. However ?smile?.length

panzerdp commented 3 years ago

Comment written by case on 09/19/2020 10:47:27

Thank you for this thorough breakdown and explanation. I learnt a lot and managed to pinpoint where the seemingly weird behavior in my code came from. Now it's all clear and makes sense (within the strange boundaries of unicode :-). Really appreciate you took the time to write this page after reading through all that background material.

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 09/19/2020 12:46:36

That's great! I'm glad you learned a lot from the post.

panzerdp commented 3 years ago

Comment written by Ali Hasani on 10/05/2020 13:44:31

Very nice and helpful.
Thanks friend !

panzerdp commented 3 years ago

Comment written by Dmitri Pavlutin on 10/05/2020 14:01:43

You're welcome amigo!

lavenderlens commented 3 years ago

Fantastic run-down Dmitri, many thanks! Under your sub-heading "Length and surrogate pairs" the variable name in the code box is str so should smile.length in the following be changed to str.length?

When str string is rendered, it contains 4 symbols cat😹. However smile.length evaluates to 5, because U+1F639 is an astral code point encoded with 2 code units (a surrogate pair).

azakharo commented 2 years ago

Фундаментально! Спасибо!

WangNingning1994 commented 2 years ago

Great post. I have one question on 'astral code points require 21 bits to save the information', take U+1F600 for example, if I transform it into binary format, it would be 11111011000000000, and the length is 17

heeringa0 commented 2 years ago

Thanks! Your explanation solved the problems I got when comparing strings.

panzerdp commented 2 years ago

Thanks! Your explanation solved the problems I got when comparing strings.

That's great @heeringa0!

aderchox commented 1 year ago

This is simply amazing. One quick question though. Let's say a Node.js buffer contains some UTF-8 bytes, then we decode it using buf.toString(). (Based on Node.js documentation it decodes from UTF-8 by default - [https://nodejs.org/api/buffer.html#buftostringencoding-start-end](buf.toString on Node.js docs)). Now the question is, what will be the decoded result's encoding? Based on your article, I think it must become UTF-16? But I'm not sure.

binderup commented 1 year ago

Cool - I have been "fighting" with a script where I want to change accented filenames into webfriendly filenames. The Danish lettes except Å was never translated correctly. I fed your description of that letter into ChatGPT and boom chatgpt implemented that into a codesnippet that worked.

panzerdp commented 1 year ago

I fed your description of that letter into ChatGPT and boom chatgpt implemented that into a codesnippet that worked.

Cool!

rhasan082 commented 1 year ago

Hi, Thanks for such amazing writeup. But I have a confusion. Chrome (Version 114.0.5735.133 (Official Build) (arm64)) console returns 1 for console.log('é'.length), whereas I was expecting 2 according to your doc. Any idea?

panzerdp / dmitripavlutin.com-comments

/what-every-javascript-developer-should-know-about-unicode/ #21