thagale / google-refine

Automatically exported from code.google.com/p/google-refine
Other
0 stars 0 forks source link

Refine messes up quotations marks on import and export. #402

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
I noticed some problems with google-refine dealing with quotes.

== SETUP ==

$ cat /etc/redhat-release 
CentOS release 5.6 (Final)

$ java -version
java version "1.6.0_17"
OpenJDK Runtime Environment (IcedTea6 1.7.10) (rhel-1.21.b17.el5-x86_64)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)

I tried both Google Refine Version 2.0 [r1836] and Version 2.1 [TRUNK].

== TEST FILE ==

The test I did uses the attached file which contains:

First column|Second column
"Quotes" with words after|02134
"Quotes on each side"|6789

I experienced problems both at import and export of the file.

== DESCRIPTION OF THE PROBLEM ==

=== Importing the file ===

I leave all the default options except for the pipe as the column separator. I 
tried both with or without "Ignore quotation marks" checked and it made no 
difference. Here is the result I get:

+-----+--------------------------+---------------+
| All | First column             | Second column |
+-----+--------------------------+---------------+
| 1.  | Quotes" with words after | 2134          |
| 2.  | Quotes on each side      | 6789          |
+-----+--------------------------+---------------+

As you see, the first quote disappeared on the first line and both quotes 
disappeared on the second.

=== Exporting the file ===

Without any modification, I select Export and Tab-separated value. Here is the 
output I get (replacing tab by pipe):

First column|Second column
"Quotes"" with words after"|2134
Quotes on each side|6789

As you can see now I get more quotes on the first line and no quotes on the 
second.

Original issue reported on code.google.com by bad...@gmail.com on 10 Jun 2011 at 9:34

Attachments:

GoogleCodeExporter commented 8 years ago
The import and export sides sound like two independent problems, but I'll have 
a look at both.

Original comment by tfmorris on 10 Jun 2011 at 10:30

GoogleCodeExporter commented 8 years ago
David told me on the mailing-list that the export problem is actually not a 
problem but a feature of CSV/TSV files. See 
http://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules.

Original comment by bad...@gmail.com on 10 Jun 2011 at 11:45

GoogleCodeExporter commented 8 years ago
Reassigning to Iain for review since he did the original "ignore quotes" 
implementation.  He should be more familiar with the intended behavior.

We're apparently running a private patched version of OpenCSV 2.2.  OpenCSV 2.3 
has been released since then, but didn't include the patch.

Original comment by tfmorris on 11 Jun 2011 at 10:31

GoogleCodeExporter commented 8 years ago
Ignore quotes affects how the parser treats the separator character only, it 
doesn't stop the parser chomping the quotation marks.

e.g.: For the line:
hello", world"
With Ignore Quotation Marks set to false the parser would return one token:
hello, world
With Ignore Quotation Marks set to true the parser would return two tokens:
hello
world
In both cases the quotations are chomped which is the correct behaviour.

The preservation of quotation marks is a separate feature.

I'm not sure if OpenCSV has a way to preserve quotation marks.  (I'll take a 
look).  If so (or it's something I can add easily), would the preservation of 
quotations be a feature that we would like to see on the importer page?

With regards to our private patched version of OpenCSV 2.2; the patch is now in 
the OpenCSV trunk and will be included in version 2.4 (I'm not sure of the 
release date of that though) 
https://sourceforge.net/tracker/?func=detail&aid=3018599&group_id=148905&atid=77
3543

Until OpenCSV 2.4 is released I think it may be better practice to build 
openCSV from its current trunk (as it will have all the improvements in the 2.3 
release as well - what those are, I'm not too sure I can't find a changelog!) 
and use that, rather than using our branched 2.2.  Any objections to doing this?

Original comment by iainsproat on 13 Jun 2011 at 2:21

GoogleCodeExporter commented 8 years ago
> would the preservation of quotations be a feature that we would like to see 
on the importer page?

I think it'd be better that whatever is generating the files to begin with are 
fixed to conform to the CSV "standard" (pick one). That is, wrap fields in 
quotes and double up quotes in the data.

Original comment by paulm%pa...@gtempaccount.com on 13 Jun 2011 at 2:31

GoogleCodeExporter commented 8 years ago
The more I learn about this, the less I'm inclined to continue down this 
convoluted path of special casing things.  I think we'd be better off just 
adding better documentation about what "ignore quotes" does and pointing people 
at documentation on how to create well-formed CSV files.

As for using the OpenCSV trunk, that seems risky to me because there's no 
telling how stable they keep their trunk.  We could reapply Iain's patch to 
2.3, but unless 2.3 has bug fixes we need, that may be more trouble than it's 
worth.  (Whatever path we choose, we should make sure opencsv-sources.jar 
matches what we use -- I got very confused during debug when it didn't contain 
the constructor we were calling)

Original comment by tfmorris on 13 Jun 2011 at 4:40

GoogleCodeExporter commented 8 years ago
>As for using the OpenCSV trunk, that seems risky to me because there's no 
telling how stable they keep their trunk.

Thinking about it, the patched "2.2" version was (if I recollect correctly) 
built off of the trunk HEAD revision at the time (June 2010)....

Original comment by iainsproat on 13 Jun 2011 at 5:57

GoogleCodeExporter commented 8 years ago
My opinion on this issue is that the import mechanism of Refine should be made 
similar to the export mechanism, ie allow importing formats and not what 
appears to be a self-written splitting method (which I understood it isn't). 
Using formats allow to point to the documentation of [CTP]SV and require the 
import files to be compatible with the standard.

Original comment by bad...@gmail.com on 14 Jun 2011 at 2:40

GoogleCodeExporter commented 8 years ago
> require the import files to be compatible with the standard.

I think we should expect import files to be for some part non-compatible with 
standards as part of the definition of "messy data".  (obviously the xml or 
json parsers will be more likely to choke on non-compatible formats than with 
csv)

There's some very large changes on the way with the importer UI, which I hope 
will help greatly with importing data.  I like the idea of having small bits of 
inline documentation though, perhaps as a tooltip.  I'm not totally sure what 
David has in his revised importer UI though. (David?)

Original comment by iainsproat on 14 Jun 2011 at 2:53

GoogleCodeExporter commented 8 years ago
Refine is a clean up tool.  Built for cleaning even non-standard formats.  That 
includes CSV and it's variations outside of the pseudo standard 
http://tools.ietf.org/html/rfc4180  There is strict rfc4180 and non-strict and 
Refine allows for both, any, and all text formats to be dealt with.  The method 
of handling quoted strings and separated data fields by various ways is handled 
quite well now actually in my opinion.  Sometimes the separation and cleanup 
can be handled after the import.  But, I do agree with Paul that we should 
adhere to whatever agreed upon standard for handling double-quotes, single 
quotes or what someprograms just call "text-qualified" fields as noted in all 
the variations of CSV formatting here: http://www.csvreader.com/csv_format.php  
Pick the method handling for quotes, stick to it throughout, and document the 
hell out of it so users are not confused.

Original comment by thadguidry on 14 Jun 2011 at 3:11

GoogleCodeExporter commented 8 years ago
Hi Iain, I've been mostly working on the plumbing and not so much on the 
details like tooltips. Perhaps you could check out the branch new-importer-ui 
and see what hints are missing? From chatting offline with Thad, he and I think 
we have all the importer levers now for at least the formats TSV/CSV/*SV, JSON, 
XML.

I'm attaching some screenshots to show the development so far.

Original comment by dfhu...@gmail.com on 15 Jun 2011 at 5:42

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 18 Sep 2012 at 5:49

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 18 Sep 2012 at 5:52