thinker007 / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Parsing tab delimited file fails while parsing field ending with a double quote #78

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Simplified input for a problem I found while loading a production file.

What steps will reproduce the problem?
1.  Create new project using attached file
2.  Leave the defaults for 'Split into columns'

What is the expected output? What do you see instead?

Instead of all 4 records being loaded, only the first 2 records are and the 2nd 
record has loaded the remainder of the file into the 'address' column.

What version of the product are you using? On what operating system?

Version 1.1-r878 on Windows XP

Please provide any additional information below.

I've tried several different ways of leaving the dbl quote character in and it 
failed on several cases.

These fail:
test"
"test
"te"st"

These are OK: (They return the right record count but possibly do the wrong 
thing with regards to the data)
test
"test"
te"st"
"te"st

Original issue reported on code.google.com by jaywgra...@gmail.com on 17 Jun 2010 at 7:06

Attachments:

GoogleCodeExporter commented 9 years ago
I'll look into this.  We use an external library for TSV parsing so my guess is 
that it's working as should be expected, but I'll check it out.

Original comment by iainsproat on 17 Jun 2010 at 8:53

GoogleCodeExporter commented 9 years ago
The importer is working as expected.  The problem is that these examples are 
malformed TSV.

To resolve this I've added an option to import ignoring all quotation marks.  
I've added this feature and committed the change (r1002) to the SVN trunk - 
there's now an 'ignore quotation marks' option in the importer.  Could you 
checkout and build the latest revision of Gridworks source and verify it works 
for you?

Please note that correct parsing behaviour using this option will rely on their 
being no tabs or newlines within quoted values.  If you've both malformed TSV 
and additional separator characters or newline characters within quoted values 
then it won't be possible to deal with it automatically.  You'll have to fix 
the data before or after import into Gridworks.

Original comment by iainsproat on 20 Jun 2010 at 2:51

GoogleCodeExporter commented 9 years ago
That should be r1010

Original comment by iainsproat on 20 Jun 2010 at 2:54

GoogleCodeExporter commented 9 years ago
I don't have the capability to build, I'll wait for the next release and test.

Original comment by jaywgra...@gmail.com on 20 Jun 2010 at 5:05

GoogleCodeExporter commented 9 years ago

Original comment by tfmorris on 18 Sep 2012 at 2:58