Closed tomas-knap closed 10 years ago
It is urgent, because national chars are wrongly handled on odcs/odcs-test currently
fine for mysql 5.6, tested on odcs.xrg.cz
It seems accented characters do not work at all for MySQL. I cannot even save empty pipeline, I get the following error.
java.sql.SQLException: Incorrect string value: '\xC4\xBE\xC5\xA1\xC4\x8D...' for column 'description' at row 1
Virtuoso 7 works fine, I also tested serializing configuration.
Please test also on the virtuoso 7 ODCS server is running, see #1235
On Sun, Feb 23, 2014 at 11:25 PM, Jan Vojt notifications@github.com wrote:
Virtuoso 7 works fine, I also tested serializing configuration.
Reply to this email directly or view it on GitHubhttps://github.com/mff-uk/ODCS/issues/1229#issuecomment-35846018 .
It seems accented characters do not work at all for MySQL.
This actually happened only on the database I copied from odcs.xrg. The problem is, that this database uses latin1 as the default charset. It must obviously use utf-8 to support utf-8 strings.
Virtuoso 7 on ODCS is also fine.
I have to reopen. National characters on Virtuoso 7 do NOT work in DPU configuration. They are displayed correctly when DPU conf dialog is closed and reopened, however stop working when pipeline detail is closed and reopened. It replaces accented characters with question marks.
Behavior is DPU specific.
�
instead of accented character in path?
instead of accented character in pathString saved in database is persisted correctly when inspecting through conductor. @bogo777 Do you think this may be caused by some TextField not supporting national characters?
Yes, this could be the source. I know that RichTextArea in Log detail displays ? instead of accented cahracters, and elsewhere it is shown correctly. But I didn't found any setting which could change charset or something like that...
Actually, it must be caused by Virtuoso database/JDBC, because with MySQL same fields work well.
What I do not understand is that when testing XSLT transformer, text fields do not work while text area for XSLT template works. Both are serialized in the same column as a serialized XML, in which accented characters are stored properly when cecking through conductor. I am not sure we will be able to solve this until defence, it seems to be some "speciality" of Virtuoso...
I just tested on odcs.xrg.cz, and here all fields saved in configuration are mangled. What version of Virtuoso is there? Also @tomas-knap please check in conductor if nvarchar columns are used here.
On ODCS, we use:
Version 6.1.8-dev.3127-pthreads as of Nov 11 2013 Compiled for Linux (x86_64-unknown-linux-gnu)
Regarding the dpu_template table:
dpu_instance:
What is quite surprising is that in case of Description (VARCHAR), national chars work properly. But in case of configuration (NVARCHAR), they do not work. I would understand if that is opposite. Any idea?
From the Virtuoso documentation about n(var)char datatype:
All the Unicode types are equivalent to their corresponding "narrow" type - CHAR, VARCHAR and LONG VARCHAR - except that instead of storing data as one byte they allow Unicode characters. Their lengths are defined and returned in characters instead of bytes. They collate according to the active wide character collation, if any.
From this excerpt I understand there is no difference between the data stored in varchar vs. nvarchar. It is used simply to give a hint to virtuoso that column stores a string with variable character length, so it needs to actually count the characters instead of just bytes when determining length. Also, collation is performed differently, which does not have anything to do with our problem.
On http://odcs.xrg.cz:8080/odcleanstore-test/, there is the version using Virtuoso 7 and the latest JDBC driver (compiled couple of days ago). Still the problem is there.
In an email conversation, Petr mentioned that Virtuoso works well with UTF-8 if we use byte array for serializedConfiguration. I tested and he is correct. The problem is, that MySQL maps TEXT data type in database to String. This results in ConversionException. We can overcome this simply by changing serializedConfiguration column data types to BLOB for MySQL (BLOB is mapped to byte array). I did some early testing and it seems to work. Will do a more thorough testing with different combinations and if everything is OK (which I expect) I will commit this on Friday probably.
Ok, sounds good, please try to correct this tomorrow. The only drawback is that when stored as BLOB, admins cannot easily read the stored configuration when using some dbAdmin tool, right?
An alternative would be to serialize/deserialize the configuration either as string or byte array, based on the database used. String variant may be used as default (as it is now) and the byte variant (as it was before playing with mysql) may be used only when Virtuooso 6/7 is used for relational db?
We have to also bear in mind that it should be possible to export configuration from mysql and import to virtuoso and vice versa, without any conversion/refinement needed.
Don't blame, but, won't this help? http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtODBCJDBCUTF8Set
http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html
/CHARSET=
This allows the client to specify a character set for data encoding. When this option is set then all Java strings, natively Unicode, are converted to the character set specified here.
If we not employ it yet, change jdbc url to include charset parameter:
jdbc:virtuoso://
:<Port#>/DATABASE= /UID= /PWD= /CHARSET=utf-8
We already use the charset param
Hmmm, we are using charset option, but we have it in lowercase. Maybe this is the problem. I will try to change it to uppercase.
Please test that before the meeting
I tried all combinations of lower/uppercase charset/utf-8. It seems that using utf-8 in lowercase ends up in failure during connecting to Virtuoso with the following error on Virtuoso side:
13:24:09 Malformed data received from IP [127.0.0.1] : Box length too large. Disconnecting the client
Making charset uppercase seems not to make any difference.
I will therefore change the datatype for MySQL and use byte array as discussed before.
Create new pipeline with e.g. RDF loader. Then, put some national chars to the path in the RDF loader detail and to the description of the RDF loader. Store to db and reopen. Please test that for mysql and virtuoso7 and tell me the result (for both fields - description, path to file).
Motivation: Currently, there is a problem with national chars in Virtuoso6. It can be solved by switching the type of configuration column from NVARCHAR to VARCHAR. But before that, it would be good to know how it behaves in Virtuoso7/mysql