tetsuo13 / MediaWiki-to-DokuWiki-Importer

Convert MediaWiki to DokuWiki
31 stars 10 forks source link

Cyrillic titles are converted to "???????" #26

Closed YSmetana closed 11 years ago

YSmetana commented 11 years ago

Hi!

All Cyrillic characters in the titles converted to question marks:

Processing ????????????_???????_NAS_(Openfiler)... 
Processing ????????????_???????_???????????_????????_(Proxmox)... 
Processing ????????????_???????_???????????????... 

My database collation is "utf8_general_ci". All tables are "utf8_general_ci".

Dirty hack was (Environment.php):

$mediaWikiSettings = new MediaWiki2DokuWiki_MediaWiki_Settings($settings['mediawiki_localsettings_file']);
$db = $mediaWikiSettings->dbConnect();
// Added by me:
$db->exec("set names utf8");

According to this: http://stackoverflow.com/questions/4361459/php-pdo-charset-set-names?answertab=votes#tab-top .

tetsuo13 commented 11 years ago

I had to do a bit of research on this one to see where the disconnect lies.

According to the MediaWiki manual on the $wgDBTableOptions configuration option MediaWiki always writes its data in UTF-8 encoding by default, DokuWiki does the same for all recent versions too. I created a page in MediaWiki on Google's philosophy, using Cyrillic for the page title and text.

When I ran MediaWiki2DokuWiki from the command line the page shows up as:

Processing Десять_базовых_принципов_Google...

This was after checking that PuTTY was using UTF-8 encoding, running it under the default encoding of ISO-8859-1:1998 (Latin-1, West Europe) produced the following output from the conversion process:

Processing ÐеÑÑÑÑ_базовÑÑ
_пÑинÑипов_Google...

Regardless of whichever encoding PuTTY had, when I checked DokuWiki afterward the page displayed correctly.

So I'm not sure where the problem may be. Did the characters of the page in DokuWiki display correctly?

YSmetana commented 11 years ago

No, the DokuWiki produces the same "????_??????".

This is a common problem in PHP. In other self-made projects I often have to set default charset right after MySQL connection is established.

Probabbly it depends on MySQL collation settings, which is not UTF-8 by default. Some people recommend to tweak MySQL setting to:

[mysqld]
init_connect=‘SET collation_connection = utf8_unicode_ci'
character-set-server = utf8
collation-server = utf8_unicode_ci

[client]
default-character-set = utf8

But in my case (Ubuntu, standard MySQL setting from ports) I did not change the setting but prepare PHP-script for correct working (as described in #1 post).


In your case:

Processing ÐеÑÑÑÑ_базовÑÑ
_пÑинÑипов_Google...

looks like the database collation was correct UTF-8. It is just a concole encoding problem.

But standart MySQl install assume that you use Latin rather than UTF-8 queries.

Tnx.

tetsuo13 commented 11 years ago

Interesting. I see no reason why forcing UTF-8 character set is a bad thing -- for most it will have no effect and will fix issues such as this one for some. I noticed no difference in my testing. Please give the latest code a try.

YSmetana commented 11 years ago

I can confirm that problem is gone now:

Processing Дзеркалювання_томів_резервних_копій_Bacula... 
Processing Додавання_нового_VPN_сервера... 

Thank you!