montera34 / pageonex

PageOneX. Analyzing front pages
http://pageonex.com
GNU Affero General Public License v3.0
54 stars 13 forks source link

Wrong encoding in the csv that list the newspapers in dev + production #120

Closed numeroteca closed 11 years ago

numeroteca commented 11 years ago

Characters from newspapers like "Egypt - Al-Tahrir - التحري" are not being properly displayed ""Egypt - Al-Tahrir - ??????". Maybe a wrong encoding when writing from the csv into the mysql database?

elplatt commented 11 years ago

For some reason the mysql databases were created with latin1 encoding even though config/databases.yml specifies utf8. I converted the media table to utf8 and recopied descriptions over from the csv. The media names are correct now, but the rest of the db is still in latin1, so thread names etc. can only contain latin characters.

These posts were helpful: http://stackoverflow.com/questions/1049728/how-do-i-see-what-character-set-a-database-table-column-is-in-mysql http://docs.moodle.org/23/en/Converting_your_MySQL_database_to_UTF8

numeroteca commented 11 years ago

Should we convert the entire database to utf8 now that we're fresh and young?

elplatt commented 11 years ago

That's a good question. Here are the options I see:

  1. Change the encoding by hand. Will corrupt any text (other than media) containing non-ASCII characters. Will be difficult to fix without rolling back to an old version of the database.
  2. Rename all tables (e.g. image -> image_latin1), recreate all tables as utf8, and write a script to copy data over. This seems like a safe option, and should work just as well later.
  3. Wait until we have import/export, then export data, recreate the database and re-import.

I'm leaning towards 2, especially if non-latin characters aren't needed immediately.

numeroteca commented 11 years ago

Agree: option 2 looks like a safe option, with a back up at hand.