Open panzerdp opened 3 years ago
Comment written by paulettom on 09/15/2016 08:30:49
Thank you so much for this 👍
Comment written by Dmitri Pavlutin on 09/15/2016 09:09:36
Thanks, enjoy the Unicode! 👽
Comment written by Tauque on 09/19/2016 07:19:30
Really thank you for this thorough article, it helps so much.
Your blog is a pleasure to read!
It is not very pratical for an asian developer to keep his code only ASCII though.
Comment written by Dmitri Pavlutin on 09/19/2016 11:11:08
Hello @tauque:disqus,
Indeed Unicode solves almost everything related to extra-ASCII chars. It just requires some level of understanding.
Thank you!
Comment written by Vadim on 09/21/2016 10:26:12
Brilliant article. Thank you.
PS: There is one lttle typo:
"[...smile] returns an array of symbols that omega string contains"
should be replcaed to:
"[...omega] returns an array of symbols that omega string contains"
Comment written by Dmitri Pavlutin on 09/21/2016 10:40:07
Thanks @disqus_AUv5qFwctk:disqus for catching up the typo! Article updated.
Comment written by Binh Thanh Nguyen on 09/23/2016 09:04:36
Thanks, a really helpful article!
Comment written by Dmitri Pavlutin on 09/23/2016 15:02:07
Awesome :) I'm glad you find it helpful.
Comment written by Abhisek P. on 09/23/2016 21:42:53
I feel like a master of Unicode ;)
Comment written by Abhisek P. on 09/23/2016 21:46:59
> The **confusing** appears when developer thinks that strings are composed of graphemes (or symbols), ignoring the code unit sequence concept.
[confusing] -> confusion
> **Its** shorter than indicating the high-surrogate and low-surrogate pair
[Its] -> It's
Comment written by Dmitri Pavlutin on 09/24/2016 05:34:16
Thanks @abhisekp:disqus, the typos fixed.
Comment written by Dmitri Pavlutin on 09/24/2016 05:34:29
That's great!
Comment written by Demian Uberti on 09/28/2016 12:56:13
you have a tipo there too
"replcaed" :P
Comment written by nevf on 09/28/2016 22:41:10
Thank you for this detailed and excellent article on understanding Unicode and with Javascript.
Comment written by Dmitri Pavlutin on 10/10/2016 13:23:21
That's great @nevf:disqus! I'm glad you like the article.
Comment written by Dmitri Pavlutin on 02/21/2017 17:46:37
Indeed Code point escape sequences are much easier to write, especially for characters outside BMP.
Comment written by Антон Шарыгин on 03/04/2018 12:03:22
Дима, спасибо за твои статьи! Доступно, понятно и с иронией :) Низкий поклон
Comment written by Ishan Test on 06/17/2018 10:25:29
Brilliant article, helped really well to understand basics. Thank you very much
Comment written by M Patil on 07/05/2018 10:12:25
Good 1.. Gr8
Comment written by Mithoon Kumar on 07/21/2018 14:10:41
Awesome article.
Comment written by Kostya Hmelnitski on 09/23/2018 19:03:11
//Get code point number using number = myString.codePointAt(index), then...
--Actually, codePointAt(index) is only partially U+10000+ Unicode-aware: it starts reading at (2*index) byte's offset into the string - exactly like charCodeAt(index) does. No matter what's in the beginning - usual two-byte characters or extended Unicodes with two surrogate halves.
So the codePointAt() improperly calculates the "index" (not the number of preceding Unicodes but the number of two-byte characters and/or surrogates).
It can point to a low-surrogate code unit of an SMP Unicode, then you'll get its [\uDC00-\uDFFF] code as a return value:
var omega = '\u{1D6C0}\u{1D6C0}\u{1D6C0}\u{1D6C0}!'; // => "𝛀𝛀𝛀𝛀!"
String.fromCodePoint(omega.codePointAt(0)) // => "𝛀"
String.fromCodePoint(omega.codePointAt(1)) // => "\udec0"
...
String.fromCodePoint(omega.codePointAt(6)) // => "𝛀"
String.fromCodePoint(omega.codePointAt(7)) // => "\udec0"
String.fromCodePoint(omega.codePointAt(8)) // => "!"
Cp.:
[...omega][3] // => "𝛀", [...omega].join('+'): "𝛀+𝛀+𝛀+𝛀+!"
[...omega][4] // => "!"
Comment written by tomwang on 09/30/2018 09:24:20
very nice!
Comment written by Eric Xu on 03/22/2019 16:28:46
Brilliant!
Comment written by Matt Langston on 06/17/2019 18:08:00
This was a very helpful article; thank you so much for taking the time to write it!
Comment written by Dmitri Pavlutin on 06/18/2019 09:04:11
Thanks @mattlangston:disqus!
Comment written by Jason Khanlar on 06/22/2019 05:17:58
This is my favorite web page!
Comment written by Dmitri Pavlutin on 07/05/2019 16:59:32
Спасибо!
Comment written by Dmitri Pavlutin on 07/05/2019 17:00:02
Thanks!
Comment written by Dmitri Pavlutin on 07/05/2019 17:00:15
Tnks.
Comment written by Dmitri Pavlutin on 07/05/2019 17:00:50
Good to know! Thanks for describind this special case.
Comment written by Solomon Rutzky on 07/23/2019 06:58:59
Hi Dmitri. Nice post. Just a few notes:
1) Please replace all occurrences of "BPM" with "BMP".
2) Please replace all occurrences of "astral" with "supplementary" (a few might need to be removed). The term "astral" is never used in Unicode. It was suggested as a possible name back in 2000 before Unicode officially settled on "supplementary" in late September of 2000. Please see the following two archived list-serve posts:
a) from Kenneth Whistler, 2000-09-12
b) from Asmus Freytag, 2000-09-29
3) While \uD83D\uDE07 is probably easier, for completeness you could mention String.fromCharCode(0xD83D, 0xDE07)
4) For more info on Unicode escape sequences (though I think you covered everything for JavaScript), please visit: Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)
Comment written by Dmitri Pavlutin on 07/23/2019 10:45:24
Thanks @solomonrutzky:disqus for the nice tips! I will update the post.
Comment written by Solomon Rutzky on 07/23/2019 13:37:11
You are welcome. Thanks for providing a comprehensive resource 😺. Also, I just noticed that the "latest version" info is now out of date. As of May, 2019 the most recent version is 12.1 which contains 137,929 characters. Please see the official announcement:
Comment written by S. Hristov on 08/24/2020 11:38:57
Nice article, thank you!
const str = 'cat\u{1F639}';
console.log(str); // => 'cat😹'
console.log(str.length); // => 5
When str string is rendered, it contains 4 symbols cat😹. However ?smile?.length
Comment written by case on 09/19/2020 10:47:27
Thank you for this thorough breakdown and explanation. I learnt a lot and managed to pinpoint where the seemingly weird behavior in my code came from. Now it's all clear and makes sense (within the strange boundaries of unicode :-). Really appreciate you took the time to write this page after reading through all that background material.
Comment written by Dmitri Pavlutin on 09/19/2020 12:46:36
That's great! I'm glad you learned a lot from the post.
Comment written by Ali Hasani on 10/05/2020 13:44:31
Very nice and helpful.
Thanks friend !
Comment written by Dmitri Pavlutin on 10/05/2020 14:01:43
You're welcome amigo!
Fantastic run-down Dmitri, many thanks!
Under your sub-heading "Length and surrogate pairs" the variable name in the code box is str
so should smile.length
in the following be changed to str.length
?
When str string is rendered, it contains 4 symbols cat😹. However smile.length evaluates to 5, because U+1F639 is an astral code point encoded with 2 code units (a surrogate pair).
Фундаментально! Спасибо!
Great post. I have one question on 'astral code points require 21 bits to save the information', take U+1F600 for example, if I transform it into binary format, it would be 11111011000000000, and the length is 17
Thanks! Your explanation solved the problems I got when comparing strings.
Thanks! Your explanation solved the problems I got when comparing strings.
That's great @heeringa0!
This is simply amazing. One quick question though. Let's say a Node.js buffer contains some UTF-8 bytes, then we decode it using buf.toString()
. (Based on Node.js documentation it decodes from UTF-8 by default - [https://nodejs.org/api/buffer.html#buftostringencoding-start-end](buf.toString on Node.js docs)). Now the question is, what will be the decoded result's encoding? Based on your article, I think it must become UTF-16? But I'm not sure.
Cool - I have been "fighting" with a script where I want to change accented filenames into webfriendly filenames. The Danish lettes except Å was never translated correctly. I fed your description of that letter into ChatGPT and boom chatgpt implemented that into a codesnippet that worked.
I fed your description of that letter into ChatGPT and boom chatgpt implemented that into a codesnippet that worked.
Cool!
Hi, Thanks for such amazing writeup. But I have a confusion. Chrome (Version 114.0.5735.133 (Official Build) (arm64)) console returns 1 for console.log('é'.length)
, whereas I was expecting 2 according to your doc. Any idea?
Written on 09/04/2016 16:26:14
URL: https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/