ratal / mdfreader

Read Measurement Data Format (MDF) versions 3.x and 4.x file formats in python
Other
169 stars 74 forks source link

Comment error when comment is encoded by ‘GBK’ #97

Closed aqkfatmtvvfb closed 6 years ago

aqkfatmtvvfb commented 6 years ago

Environment:

  1. OS:Windows7(Chinese)
  2. Python ver:3.5.2

    Comment Show In OS:

    '贴马路牙子左右转向'

    Comment Read by mdfreader:

    'ÌùÂí·ÑÀ×Ó×óÓÒתÏò'

    error analysis and solution.

    raw comment is encoded by ‘GBK’

    '贴马路牙子左右转向'.encode('GBK') b'\xcc\xf9\xc2\xed\xc2\xb7\xd1\xc0\xd7\xd3\xd7\xf3\xd3\xd2\xd7\xaa\xcf\xf2' solution 'ÌùÂí·ÑÀ×Ó×óÓÒתÏò'.encode('latin-1').decode('GBK') '贴马路牙子左右转向'

danielhrisca commented 6 years ago

Hello Yu,

string should be 'UTF-8' for MDF version 4, and 'latin-1' for MDF 3.

The application that generates the file has to use those encoding:

bytestring = '贴马路牙子左右转向'.encode('utf-8')
string = bytestring.decode('utf-8')
aqkfatmtvvfb commented 6 years ago

@danielhrisca thanks for this information! i think it should be necessary to detect the encoding method after reading raw comment in MDF3 and before saving it to the data structure. there is an existing library named 'chardet' can solve this problem.

chardet.detect(comment.encode('latin-1')) {'confidence': 0.99, 'language': 'Chinese', 'encoding': 'GB2312'}

danielhrisca commented 6 years ago

Basically you are getting a file that does not comply with the MDF standard. The tool developer should make sure to follow the standard.

ratal commented 6 years ago

I would agree on Daniel's point, specification forces utf-8 usage. However Yu, you are also right, user could add new channel or modify comments and mdfreader does not check compliance to spec including character encoding when writing file So chardet introduction could be interesting consideration if you have time to contribute :)