techouse / mysql-to-sqlite3

Transfer data from MySQL to SQLite
https://techouse.github.io/mysql-to-sqlite3/
MIT License
217 stars 31 forks source link

Conversion from MariaDB in encoding utf8mb4 to sqlite with utf8 produces invalid encoding output #19

Closed Sahara150 closed 3 years ago

Sahara150 commented 3 years ago

When I convert my MySQL database to a SQLite Database, it produces some weird output and my IDE tells me, that this doesn´t seem to be UTF-8 for sure. The output begins like this:

SQLite format 3   @     G  ·           9                                                 G .4    ñ   ™ûöñë    .Çãp&·“*"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             g=indexfrench_index_French_typefrenchCREATE INDEX "french_index_French_type" ON "french" ("type")pC

And then continues with the text, having weird symbols from time to time.

Versions:

OS: Windows 10 Home
MariaDB: 10.4.11
Python: 3.8.2

MySQL-Encoding info from dump:

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;

All the tables are in utf8mb4 encoding, since it also contains letters like é, ç, ā, ä, etc. (most of them are also part of utf8, but I am not sure about all of them)

techouse commented 3 years ago

Hey,

Thanx for the report.

Can you provide me with a partial dump of the database so that I can investigate, unless it's too sensitive data.

Also, can you please run mysql2sqlite --version so that I can see all the versions of the dependencies?

On top of that, can you open the SQLite dump with a tool like DBeaver and check if it also shows weird encoding?

Sahara150 commented 3 years ago

SQLite database dump:

SQLite format 3   @     G  ·           9                                                 G .4    ñ   ™ûöñë    .Çãp&·“*"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             g=indexfrench_index_French_typefrenchCREATE INDEX "french_index_French_type" ON "french" ("type")pCindexfrench_index_French_articlefrenchCREATE INDEX "french_index_French_article" ON "french" ("article")ƒ/
†9tablefrenchfrenchCREATE TABLE "french" (
    "id" INTEGER NOT NULL  ,
    "article" TEXT NULL  ,
    "word" TEXT NOT NULL  ,
    "type" INTEGER NOT NULL  ,
    PRIMARY KEY ("id"),
    FOREIGN KEY("id") REFERENCES "word" ("id") ON UPDATE NO ACTION ON DELETE CASCADE,
    FOREIGN KEY("id") REFERENCES "word" ("id") ON UPDATE NO ACTION ON DELETE CASCADE,
    FOREIGN KEY("type") REFERENCES "wordtypes" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION
)mAindexenglish_index_English_typeenglish
CREATE INDEX "english_index_English_type" ON "english" ("type")‚G„etableenglishenglishCREATE TABLE "english" (
    "id" INTEGER NOT NULL  ,
    "word" TEXT NOT NULL  ,
    "type" INTEGER NOT NULL  ,
    PRIMARY KEY ("id"),
    FOREIGN KEY("id") REFERENCES "word" ("id") ON UPDATE NO ACTION ON DELETE CASCADE,
    FOREIGN KEY("type") REFERENCES "wordtypes" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION
)q
7%indexdeclinations_strengthdeclinationsCREATE INDEX "declinations_strength" ON "declinations" ("strength")a %‚%indexdeclinations_index_Declinations_genus_case_strength_numerusdeclinations
CREATE UNIQUE INDEX "declinations_index_Declinations_genus_case_strength_numerus" ON "declinations" ("genus", "case", "strength", "numerus")e/%indexdeclinations_casedeclinations CREATE INDEX "declinations_case" ON "declinations" ("case")…:%%Š7tabledeclinationsdeclinationsCREATE TABLE "declinations" (
    "id" INTEGER NOT NULL  ,
    "genus" INTEGER NOT NULL  ,
    "case" INTEGER NOT NULL  ,
    "strength" INTEGER NOT NULL  ,
    "numerus" INTEGER NOT NULL DEFAULT '0' ,
    "ending" TEXT NOT NULL  ,
    "alternativeEnding" TEXT NOT NULL  ,
    "lettersWithAlternative" TEXT NOT NULL  ,
    "lettersWithoutEnding" TEXT NOT NULL  ,
    "article" TEXT NOT NULL  ,
    PRIMARY KEY ("id"),
    FOREIGN KEY("genus") REFERENCES "articles" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION,
    FOREIGN KEY("case") REFERENCES "cases" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION,
    FOREIGN KEY("strength") REFERENCES "strengths" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION
)zStablecasescasesCREATE TABLE "cases" (
    "id" INTEGER NOT NULL  ,
    "caseName" TEXT NOT NULL  ,
    PRIMARY KEY ("id")
)Z+{indexbeorhave_wordIdbeorhaveCREATE INDEX "beorhave_wordId" ON "beorhave" ("wordId")‚a…tablebeorhavebeorhaveCREATE TABLE "beorhave" (
    "entryId" INTEGER NOT NULL  ,
    "wordId" INTEGER NOT NULL  ,
    "person" INTEGER NOT NULL  ,
    "numerus" INTEGER NOT NULL  ,
    "word" TEXT NOT NULL  ,
    "transitive" INTEGER NOT NULL  ,
    PRIMARY KEY ("entryId"),
    FOREIGN KEY("wordId") REFERENCES "german" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION
)WCindexarticles_index_Articles_germanArticlearticlesCREATE UNIQUE INDEX "articles_index_Articles_germanArticle" ON "articles" ("germanArticle")WCindexarticles_index_Articles_frenchArticlearticlesCREATE UNIQUE INDEX "articles_index_Articles_frenchArticle" ON "articles" ("frenchArticle")&‚tablearticlesarticlesCREATE TABLE "articles" (
    "id" INTEGER NOT NULL  ,
    "germanArticle" TEXT NOT NULL  ,
    "frenchArticle" TEXT NULL  ,
    PRIMA  ‰+      
   Ô õêáÔ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      der/die  das    diela   derle
   é îéóú                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           la    le
   Ý éùÝñ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             der/diedasdie der
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
   Ê ñäÙÊ                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
 !accusative   dative genitive
 !nominative
...

MySQL Database Dump:

-- phpMyAdmin SQL Dump
-- version 5.0.2
-- https://www.phpmyadmin.net/
--
-- Host: 127.0.0.1
-- Erstellungszeit: 02. Aug 2021 um 10:03
-- Server-Version: 10.4.11-MariaDB
-- PHP-Version: 7.4.4

SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
START TRANSACTION;
SET time_zone = "+00:00";

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;

--
-- Datenbank: `dictionarydb`
--

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `articles`
--

CREATE TABLE `articles` (
  `id` int(11) NOT NULL,
  `germanArticle` text NOT NULL,
  `frenchArticle` text DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

--
-- Daten für Tabelle `articles`
--

INSERT INTO `articles` (`id`, `germanArticle`, `frenchArticle`) VALUES
(1, 'der', 'le'),
(2, 'die', 'la'),
(3, 'das', NULL),
(4, 'der/die', NULL);

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `beorhave`
--

CREATE TABLE `beorhave` (
  `entryId` int(11) NOT NULL,
  `wordId` int(11) NOT NULL,
  `person` int(11) NOT NULL,
  `numerus` int(11) NOT NULL,
  `word` text NOT NULL,
  `transitive` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `cases`
--

CREATE TABLE `cases` (
  `id` int(11) NOT NULL,
  `caseName` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

--
-- Daten für Tabelle `cases`
--

INSERT INTO `cases` (`id`, `caseName`) VALUES
(1, 'nominative'),
(2, 'genitive'),
(3, 'dative'),
(4, 'accusative');

-- --------------------------------------------------------

--
-- Tabellenstruktur für Tabelle `declinations`
--

CREATE TABLE `declinations` (
  `id` int(11) NOT NULL,
  `genus` int(11) NOT NULL,
  `case` int(11) NOT NULL,
  `strength` int(11) NOT NULL,
  `numerus` int(11) NOT NULL DEFAULT 0,
  `ending` text NOT NULL,
  `alternativeEnding` text NOT NULL,
  `lettersWithAlternative` text NOT NULL,
  `lettersWithoutEnding` text NOT NULL,
  `article` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
...

Output from mysql2sqlite --version:

| software               | version                              |
|------------------------|--------------------------------------|
| mysql-to-sqlite3       | 1.4.4                                |
|                        |                                      |
| Operating System       | Windows 10                           |
| Python                 | CPython 3.8.2                        |
| MySQL                  | MySQL client not found on the system |
| SQLite                 | 3.28.0                               |
|                        |                                      |
| click                  | 8.0.1                                |
| mysql-connector-python | 8.0.26                               |
| python-slugify         | 5.0.2                                |
| pytimeparse            | 1.1.8                                |
| simplejson             | 3.17.3                               |
| six                    | 1.16.0                               |
| tabulate               | 0.8.9                                |
| tqdm                   | 4.62.0                               |

MariaDB is part of shipped Apache server, I guess that´s why it´s not able to show the version. (My xampp folder is not part of the programs folder) However since it generates a file of the db at all, I don´t think that´s an issue.

Apache versions:

+ Apache 2.4.43
  + MariaDB 10.4.11
  + PHP 7.4.4 (VC15 X86 64bit thread safe) + PEAR
  + phpMyAdmin 5.0.2
  + OpenSSL 1.1.0g
  + ADOdb 518a
  + Mercury Mail Transport System v4.63 (not included in the portable version)
  + FileZilla FTP Server 0.9.41 (not included in the portable version)
  + Webalizer 2.23-04 (not included in the portable version)
  + Strawberry Perl 5.16.3.1 Portable
  + Tomcat 7.0.103
  + XAMPP Control Panel Version 3.2.4.
  + XAMPP mailToDisk 1.0 (write emails via PHP on local disk in <xampp>\mailoutput. Activated in the php.ini as mail default.)

Unfortunately I can´t download the tool in a reasonable time, since I am in a country with veeeery bad internet. Right now it´s showing 16hours download time for 90MB. I´m still downloading it, but I won´t know the result today.

Note: I just found out, that apparently the utf8mb4 encoding of MySQL is the real utf8 encoding, since their "utf-8" encoding only supports 3 bytes, not 4 as the standard should. Which makes it even more important, that the conversion in that encoding works and even more weird, that it doesn´t seem to do so. (I would expect that there´s no additional code needed for that, since it actually is the same encoding, just a different name. :O)

techouse commented 3 years ago

Wait what do you mean by "SQLite database dump"?

I think you missunderstand what that file is. 😄 The generated file is not a dump, it's a database binary file which you open with SQLite.

Try opening it with sqlite.exe or wait for DBeaver to finish downloading (so sorry to hear about your internet problems - reminds me of the 90s and 56k dial-up when everything was painfully slow 😢).

Yea, utf8mb4 basically added support for storing emojis in the database. 🤣

Sahara150 commented 3 years ago

Oooooh, so it´s not meant to be a .sql file, but a .db file? :OOO That would explain a lot. :D Alright, I will try to open it with DBeaver, when it finally finished downloading. Yep, the internet connection is indeed in the state of the 90s. :D But you really learn a lot about data optimization, when you are coding here. ;)

techouse commented 3 years ago

Hehe, yes, that's not a dump file but an actual working database 😄 Check out the details here https://www.sqlite.org/fileformat.html

Let me know if you still have problems once you have opened it in a database viewer, like DBeaver etc.