4 byte limitation of database management systems

HolQue commented 1 year ago

Problem: Database management systems with older language versions are not able to handle 4 byte characters.

Details in attached PDF. Database4ByteLimitation.pdf

test-fullautomation commented 1 year ago

Hi Holger,

I don't know if a filtering is required here. In principle the whole tool chain supports UTF-8. We need to check why we run in errors here. I made this Prio 3, because we don't have a requirement for this till now. Nevertheless it should be fixed.

Thank you, Thomas

HolQue commented 1 year ago

Hi Thomas,

the problem is not UTF-8 itself, but 4 Byte characters together with older language versions of database management systems. Therefore the filtering is required. The database is the problem, not the UTF-8 handling inside our Python sources.

HolQue commented 1 year ago

https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434

test-fullautomation commented 1 year ago

Hi Holger,

O.K. got the point.

Thank you, Thomas

ngoan1608 commented 1 year ago

Hi Thomas, Hi Holger,

This issue requires a change in https://github.com/test-fullautomation/testresultwebapp insteads of this repo. So I moved this issue to testresultwebapp.

Thank you, Ngoan

HolQue commented 1 year ago

I am not sure.

The replacement of 4 Byte characters has to happen before the SQL interface is accessed!

Inside RobotLog2DB\CDataBase.py / __arExec() / c.execute(command,values): values must contain the already converted content - otherwise crash.

I gave it a try with this quick&dirty hack in CDataBase.py:

For every value:

if isinstance(value, str):
   value = re.sub('[\U00010000-\U0010ffff]', '\ufffd', value)

'\ufffd' is a question mark within a black diamond; it is used as a replacement character to indicate problems when a system is unable to render a stream of data to a correct symbol.

Input in XML result file of Robot Framework (with last character is a 4 Byte one):

<meta name="project">Ä.ß.€.考.𠼭</meta>

With the hack RobotLog2DB is able to handle this:

Created test execution result for variant 'Ä.ß.€.考.𠼭' - version 'QT-VSW-050' successfully: 7741aea1-10b0-4d09-aabc-c8ebab7acc6b The name of this variant is displayed properly in dashboard variant selection list of test database.

HolQue commented 1 year ago

Entire __arExec() hack:

try:
   regexHighPoints = re.compile('[\U00010000-\U0010ffff]')
except re.error:
   # UCS-2 build
   regexHighPoints = re.compile('[\uD800-\uDBFF][\uDC00-\uDFFF]')

# '\ufffd' is a question mark within a black diamond; it is used as a replacement character to indicate problems
# when a system is unable to render a stream of data to a correct symbol.
# The byte in the middle of the three bytes that are needed to code this character, is not valid for UTF-8.
# Therefore the literal cannot be used within this file.
# The unicode escape sequence \ufffd is used here instead.

values_new = []
if values is not None:
   for value in values:
      if ( (value is not None) and (isinstance(value, str)) ):
         value = regexHighPoints.sub('\ufffd', value)
      values_new.append(value)
   c.execute(command,values_new)
else:
   c.execute(command,values)

HolQue commented 1 year ago

Background knowledge: https://stackoverflow.com/questions/10798605/warning-raised-by-inserting-4-byte-unicode-to-mysql https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python

test-fullautomation commented 1 year ago

Hi Ngoan, can you take this over to all log2db tools? Please confirm when done... Thank you, Thomas

ngoan1608 commented 1 year ago

Hi Holger, Hi Thomas

Thank a lot for Holger's investigations and proposal! They are very useful for me to approach the solution for this issue. :)

But my idea is our tool chain (import tools, database and WebApp) will support 4 byte characters instead of masking as question mark.

The feature requires changes of charset in our database and all connections which communicate with db (webapp, import tools). It means that the 4 byte chars will be maintained from the result file (command line argument, config json file) to database and would be displayed properly on WebApp as below image:

Here is all changes that I have successfully tried on my system:

TestResultWebApp database:
- DB schema charset should be change from utf8 (alias of utf8mb3) to utf8mb4.
- Store procedures and functions which defined charset utf8 insides should be update to utf8mb4
Import tools: charset of DB connection should be added as utf8mb4 (instead of using default value utf8)
WebApp server: charset of DB connection should be set as utf8mb4

Please give your idea about this issue. Then I will try with the testdb then create all related pull requests.

Thank you,

HolQue commented 1 year ago

Hi Ngoan,

a really good solution! And much better than filtering out critical characters.

I like!

test-fullautomation commented 1 year ago

Hi Ngoan,

thank you for further investigation.

We implement your proposal.

Thank you, Thomas

test-fullautomation commented 1 year ago

solved with #18

test-fullautomation / testresultwebapp

4 byte limitation of database management systems #16