untunt / kanbunHTML

A kanbun-kundoku (漢文訓読) HTML display solution
https://phesoca.com/kanbun-html/
GNU Affero General Public License v3.0
48 stars 5 forks source link

Fix: non-BMP kanji not handled properly #3

Closed syimyuzya closed 2 years ago

syimyuzya commented 2 years ago

str.split('') does NOT split the string by Unicode code points but by UTF-16 code units, causing non-BMP kanji being split into surrogate pairs.

The correct way for turning a string into an array is str.split() or [...str].