mff-uk / odcs

ODCleanStore
1 stars 11 forks source link

Please test national characters in configuration for mysql/virtuoso7 #1229

Closed tomas-knap closed 10 years ago

tomas-knap commented 10 years ago

Create new pipeline with e.g. RDF loader. Then, put some national chars to the path in the RDF loader detail and to the description of the RDF loader. Store to db and reopen. Please test that for mysql and virtuoso7 and tell me the result (for both fields - description, path to file).

Motivation: Currently, there is a problem with national chars in Virtuoso6. It can be solved by switching the type of configuration column from NVARCHAR to VARCHAR. But before that, it would be good to know how it behaves in Virtuoso7/mysql

tomas-knap commented 10 years ago

It is urgent, because national chars are wrongly handled on odcs/odcs-test currently

tomas-knap commented 10 years ago

fine for mysql 5.6, tested on odcs.xrg.cz

janvojt commented 10 years ago

It seems accented characters do not work at all for MySQL. I cannot even save empty pipeline, I get the following error.

java.sql.SQLException: Incorrect string value: '\xC4\xBE\xC5\xA1\xC4\x8D...' for column 'description' at row 1
janvojt commented 10 years ago

Virtuoso 7 works fine, I also tested serializing configuration.

tomas-knap commented 10 years ago

Please test also on the virtuoso 7 ODCS server is running, see #1235

On Sun, Feb 23, 2014 at 11:25 PM, Jan Vojt notifications@github.com wrote:

Virtuoso 7 works fine, I also tested serializing configuration.

Reply to this email directly or view it on GitHubhttps://github.com/mff-uk/ODCS/issues/1229#issuecomment-35846018 .

janvojt commented 10 years ago

It seems accented characters do not work at all for MySQL.

This actually happened only on the database I copied from odcs.xrg. The problem is, that this database uses latin1 as the default charset. It must obviously use utf-8 to support utf-8 strings.

Virtuoso 7 on ODCS is also fine.

janvojt commented 10 years ago

I have to reopen. National characters on Virtuoso 7 do NOT work in DPU configuration. They are displayed correctly when DPU conf dialog is closed and reopened, however stop working when pipeline detail is closed and reopened. It replaces accented characters with question marks.

janvojt commented 10 years ago

Behavior is DPU specific.

String saved in database is persisted correctly when inspecting through conductor. @bogo777 Do you think this may be caused by some TextField not supporting national characters?

bogo777 commented 10 years ago

Yes, this could be the source. I know that RichTextArea in Log detail displays ? instead of accented cahracters, and elsewhere it is shown correctly. But I didn't found any setting which could change charset or something like that...

janvojt commented 10 years ago

Actually, it must be caused by Virtuoso database/JDBC, because with MySQL same fields work well.

janvojt commented 10 years ago

What I do not understand is that when testing XSLT transformer, text fields do not work while text area for XSLT template works. Both are serialized in the same column as a serialized XML, in which accented characters are stored properly when cecking through conductor. I am not sure we will be able to solve this until defence, it seems to be some "speciality" of Virtuoso...

I just tested on odcs.xrg.cz, and here all fields saved in configuration are mangled. What version of Virtuoso is there? Also @tomas-knap please check in conductor if nvarchar columns are used here.

tomas-knap commented 10 years ago

On ODCS, we use:

Version 6.1.8-dev.3127-pthreads as of Nov 11 2013 Compiled for Linux (x86_64-unknown-linux-gnu)

Regarding the dpu_template table:

screen shot 2014-03-05 at 1 21 00 pm

dpu_instance:

screen shot 2014-03-05 at 1 21 48 pm

What is quite surprising is that in case of Description (VARCHAR), national chars work properly. But in case of configuration (NVARCHAR), they do not work. I would understand if that is opposite. Any idea?

janvojt commented 10 years ago

From the Virtuoso documentation about n(var)char datatype:

All the Unicode types are equivalent to their corresponding "narrow" type - CHAR, VARCHAR and LONG VARCHAR - except that instead of storing data as one byte they allow Unicode characters. Their lengths are defined and returned in characters instead of bytes. They collate according to the active wide character collation, if any.

From this excerpt I understand there is no difference between the data stored in varchar vs. nvarchar. It is used simply to give a hint to virtuoso that column stores a string with variable character length, so it needs to actually count the characters instead of just bytes when determining length. Also, collation is performed differently, which does not have anything to do with our problem.

tomas-knap commented 10 years ago

On http://odcs.xrg.cz:8080/odcleanstore-test/, there is the version using Virtuoso 7 and the latest JDBC driver (compiled couple of days ago). Still the problem is there.

janvojt commented 10 years ago

In an email conversation, Petr mentioned that Virtuoso works well with UTF-8 if we use byte array for serializedConfiguration. I tested and he is correct. The problem is, that MySQL maps TEXT data type in database to String. This results in ConversionException. We can overcome this simply by changing serializedConfiguration column data types to BLOB for MySQL (BLOB is mapped to byte array). I did some early testing and it seems to work. Will do a more thorough testing with different combinations and if everything is OK (which I expect) I will commit this on Friday probably.

tomas-knap commented 10 years ago

Ok, sounds good, please try to correct this tomorrow. The only drawback is that when stored as BLOB, admins cannot easily read the stored configuration when using some dbAdmin tool, right?

tomas-knap commented 10 years ago

An alternative would be to serialize/deserialize the configuration either as string or byte array, based on the database used. String variant may be used as default (as it is now) and the byte variant (as it was before playing with mysql) may be used only when Virtuooso 6/7 is used for relational db?

tomas-knap commented 10 years ago

We have to also bear in mind that it should be possible to export configuration from mysql and import to virtuoso and vice versa, without any conversion/refinement needed.

ghost commented 10 years ago

Don't blame, but, won't this help? http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtODBCJDBCUTF8Set

http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html

/CHARSET= This allows the client to specify a character set for data encoding. When this option is set then all Java strings, natively Unicode, are converted to the character set specified here.

If we not employ it yet, change jdbc url to include charset parameter:

jdbc:virtuoso://:<Port#>/DATABASE=/UID=/PWD=/CHARSET=utf-8

tomas-knap commented 10 years ago

We already use the charset param

janvojt commented 10 years ago

Hmmm, we are using charset option, but we have it in lowercase. Maybe this is the problem. I will try to change it to uppercase.

tomas-knap commented 10 years ago

Please test that before the meeting

janvojt commented 10 years ago

I tried all combinations of lower/uppercase charset/utf-8. It seems that using utf-8 in lowercase ends up in failure during connecting to Virtuoso with the following error on Virtuoso side:

13:24:09 Malformed data received from IP [127.0.0.1] : Box length too large. Disconnecting the client

Making charset uppercase seems not to make any difference.

I will therefore change the datatype for MySQL and use byte array as discussed before.