thombashi / sqlitebiter

A CLI tool to convert CSV / Excel / HTML / JSON / Jupyter Notebook / LDJSON / LTSV / Markdown / SQLite / SSV / TSV / Google-Sheets to a SQLite database file.
https://sqlitebiter.rtfd.io/
MIT License
850 stars 50 forks source link

decode utf8 error #41

Closed wby238 closed 7 years ago

wby238 commented 7 years ago

Env:x64 win7, python3.6.2

I used 'sqlitebiter file test.json -o test.sqlite', the test.json is encoded as utf-8 without bom, but I get the following error:

[ERROR] sqlitebiter file: invalid json data format: path=test.json, message='gbk' codec can't decode byte 0xa6 in position 336: illegal multibyte sequence
thombashi commented 7 years ago

Thank you for your report.

Could you possibly provide me the problematic JSON data? I've tried to reproduce the problem with UTF-8 (without BOM) JSON file, but I couldn't (sqlitebiter completed without errors).

wby238 commented 7 years ago

I'm sorry that now I can't use my pc. That json has many chinese words, you can write a simple json to test it, such as:

{
"sites": [
{ "name":"菜鸟教程" , "url":"m.runoob.com" }, 
{ "name":"google" , "url":"www.google.com" }, 
{ "name":"微博" , "url":"www.weibo.com" }
]
}

The reason may be sqlitebiter read the json as gbk codec, although that json is utf8.

wby238 commented 7 years ago

The following method can get the original json:

  1. git clone this: https://github.com/Binaryify/NeteaseCloudMusicApi
  2. cd the path and node app.js (nodejs 6.0+)
  3. save the json from this address: http://localhost:3000/playlist/detail?id=612879229

Thank you for your reply.

wby238 commented 7 years ago

debug info:

[DEBUG] sqlitebiter file: converting 'test.json'
[DEBUG] pytablereader: TableFileLoaderFactory.create_from_path: extension=json
[DEBUG] pytablereader: loading file: format=json, path=test.json, encoding=utf-8

[ERROR] sqlitebiter file: invalid json data format: path=test.json, message='gbk' codec can't decode byte 0xa6 in position 336: illegal multibyte sequence
[DEBUG] sqlitebiter file: ----- sqlitebiter completed -----
[DEBUG] sqlitebiter file: database path: test.sqlite
[DEBUG] sqlitebiter file: number of created table: 0
[DEBUG] sqlitebiter file:
[DEBUG] sqlitebiter file: ----- database schema -----
[DEBUG] SimpleSQLie: connect to a SQLite database: path='test.sqlite', mode=r
[DEBUG] SimpleSQLie: close connection to a SQLite database: path='G:\test.sqlite'
[DEBUG] sqlitebiter file:
[DEBUG] SimpleSQLie: close connection to a SQLite database: path='G:\test.sqlite'

original json: test.zip

thombashi commented 7 years ago

Thank you for your additional information. I could reproduce the problem.

It looks like there are two problems, encoding and JSON format:

Problems

1. encoding

sqlitebiter not properly opening for utf8 input files some formats (include JSON files). This cause an encoding error at Windows environment.

2. JSON file format

Original JSON file (test.json) that you mentioned in the previous comment is not convertible as it is by sqlitebiter. Convertible JSON formats for sqlitebiter are very limited for now (details are described in here).

Solution

1. encoding

I've fix a dependency package that loading files. Please try upgrade the sqlitebiter dependency packages with pip install sqlitebiter --upgrade. You could convert utf8 files properly.

>type utf.json
{
"sites": [
{ "name":"闖憺ク滓蕗遞・ , "url":"m.runoob.com" },
{ "name":"google" , "url":"www.google.com" },
{ "name":"蠕ョ蜊・ , "url":"www.weibo.com" }
]
}
>sqlitebiter file utf.json
[INFO] sqlitebiter file: convert 'utf.json' to 'sites' table

>sqlite3 out.sqlite
sqlite> .schema
CREATE TABLE IF NOT EXISTS 'sites' (name TEXT, url TEXT);

2. JSON file format

You could extract convertible JSON data by jq command from the JSON, and convert it by sqlitebiter. If you have a certain database schema that you would like to extract from the JSON file.

wby238 commented 7 years ago

Thank you for your reply!