jsha / blocktogether

Share your blocks and subscribe to others'
GNU General Public License v3.0
331 stars 68 forks source link

Handle non-ASCII in display names #66

Closed jsha closed 10 years ago

jsha commented 10 years ago

Right now they are transformed to ???. Probably an encoding issue - i.e. not properly marking output encoding as UTF-8.

BooDoo commented 10 years ago

When testing this, my build is just dropping astral characters when writing to the database: lyrpic-emoji On the left is the display in My Blocks, which is also what's stored in the TwitterUsers table; on the right is the display name on account's profile page.

You have emoji in the db, but they render as ??? when displayed in My Blocks/Show Blocks?

jsha commented 10 years ago

I hadn't dug into it yet. My first guess was output encoding, since I assumed both node-twitter-api and Sequelize handled encoding correctly. But taking a second look, I see that I have all ??? in the DB, even for the non-astral Japanese characters. Which leads me to think that some component is downconverting to ASCII. See for example https://blocktogether.org/show-blocks/f9ecc402d2e5f5d29199364b91509696b89e539c1e5152a685ca5a01dac752cf22ecdd24f8c100d6e7137968223750da.

On the prod instance and my local dev machine, LC_ALL is set to en_US.UTF-8.

BooDoo commented 10 years ago

In prod it looks like the astral are dropped entirely A??Z???? = A[soccer ball][airplane]Z[de][shi][yo][u] and the non-ASCII are downconverted.

Storing emoji or other non-BMP in mysql may require utf8mb4 or utf16 where it's needed, instead of just utf8, and it looks like Sequelize may have spotty handling of these charset without some futzing [see: sequelize/sequelize#1220]

High investment, very sensitive to platform, low payoff? Shelved imo.

jsha commented 10 years ago

Agreed. Closing for now, can spelunk and reopen if necessary. Thanks for helping evaluate!

jsha commented 10 years ago

Reopening as 'support non-ASCII,' since this heavily affects Japanese users.

I realized it may be as simple as this:

show create database blocktogether; CREATE DATABASE blocktogether /!40100 DEFAULT CHARACTER SET latin1 / |

So I think I need to change the default character set on my DBs.

BooDoo commented 10 years ago

That seems likely/easy! I'm using mariadb (on Arch) and it looks like my databases defaulted to utf8 character set with no intervention, so this hadn't even come up in my environment.