kobotoolbox / koboform

A Java/GWT based formbuilder (no longer supported). Replaced by dkobo.
2 stars 8 forks source link

KoBoSync fails to handle special characters. #95

Open Dylan-Gillespie opened 11 years ago

Dylan-Gillespie commented 11 years ago

From neil.hen...@kobotoolbox.org on January 11, 2013 16:16:24

What steps will reproduce the problem? KoBoSync 0.93 corrupting instances with non-ANSI characters (such as the French accents) that are being transcribed into CSV. Here's the pattern:

1) KoBoForm transcription of instances containing non-ANSI characters: Instances are being transcribed correctly from the XML folder into the CSV file in the CSV folder. Special characters such as é (in the XML) are transcribed properly (e.g. é in this example). (See column 302 on line 2)

2) Click on 'Convert to CSV again'. Now check the CSV: The single special characters in an existing line in the CSV file is replaced by two other random non-ANSI characters. (e.g. é became é).

3) At every subsequent transcription, each of the new random non-ANSI characters is being replaced by another string of more random non-ANSI characters. This means that the single non-ANSI character is exponentially being replaced by an ever-growing string. (e.g. a single é character becomes a string of 65000 characters after just 15 syncs)

(3) will happen every time the 'transcribe' button is hit - even if there are no new XML being transcribed.

Original from the XML: Handicapée mentale First iteration of column AV: Handicapée mentale Second iteration of column AV Handicapée mentale

Attachment: a2j_JURI_2012-08-29_10-53-20.xml JURI.csv

Original issue: http://code.google.com/p/kobo/issues/detail?id=94

Dylan-Gillespie commented 11 years ago

From neil.hen...@kobotoolbox.org on January 11, 2013 15:17:02

Labels: -Priority-Medium -app-KoBoForm Priority-High app-KoBoSync

Dylan-Gillespie commented 11 years ago

From gary.hendrick on February 19, 2013 08:52:10

This issue does not seem to apply with KoboSync 0.93 running on : java version "1.6.0_27" OpenJDK Runtime Environment (IcedTea6 1.12.1) (6b27-1.12.1-2ubuntu0.12.04.2) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Using the above xml file we end up with the resulting csv file attached.

Attachment: JURI.csv

Dylan-Gillespie commented 11 years ago

From gary.hendrick on February 19, 2013 09:29:14

The issue can be reproduced using the following JRE on Windows. See the attached JURI.csv file to observe results

C:\Users\gary>java -version java version "1.7.0_11" Java(TM) SE Runtime Environment (build 1.7.0_11-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

Attachment: JURI.csv

Dylan-Gillespie commented 11 years ago

From gary.hendrick on March 11, 2013 09:00:21

The issue is related to the default character set of the JRE running the application. The java.io.FileReader used to instantiate the SuperCSV CSVMapReader in the CSV transcription process uses the systems' default encoding. If this encoding is not set to one friendly with the input characters, then the file is read in, the characters are mangled, and when it is written out the mangling is preserved in the file.

The java.io.FileReader javadoc provides an appropriate solution to the issue : "The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream."

After replacing the FileReader with an InputStreamReader we have a working solution. The attached JURI.csv is the desired resultant, after several "transcribe" commands were executed.

Attachment: JURI.csv

Dylan-Gillespie commented 11 years ago

From gary.hendrick on March 11, 2013 09:01:32

Note that the file attached to comment #4 is trimmed down to just contain the requisite fields. This made development a little simpler while preserving the core factors of the failure case