princevil / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Excel importer ignores "auto detect value types" setting #189

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
When processing a document with a big number column (tax identification 
numbers, they are always 9 digits, numbers only) it is not possible to convert 
that number to a proper string for use in further processing (for example to 
append to URL when fetching further data from web services).

I need an option to convert the long number to a string. (and not in scientific 
notation!)

Original issue reported on code.google.com by shinhan....@gmail.com on 11 Nov 2010 at 1:20

GoogleCodeExporter commented 8 years ago
As a temporary hack, you could try something like this:

with(1000000, div, floor(value / div).toString() + (div + mod(floor, 
div)).toString().substring(1))

Original comment by dfhu...@gmail.com on 11 Nov 2010 at 3:13

GoogleCodeExporter commented 8 years ago
If I copy paste that to transform, I get error "Error: mod expects 2 numbers"

mod(floor,div)

div seems to be a variable (not sure as I cant find documentation on the with 
function), floor doesnt. If second floor is supposed to be floor(value/div), 
then the full expression would be:

with(1000000, div, floor(value / div).toString() + (div + mod(floor(value / 
div), div)).toString().substring(1))

But the results of that are very much different from the input.

Btw, the temporary hack I used was importing this to excel, making a field 
=CONCATENATE("s",numfield), reloading it in google refine and then using 
value.substring(1). So, I know a hack, but this really needs a proper solution.

Original comment by shinhan....@gmail.com on 12 Nov 2010 at 7:17

GoogleCodeExporter commented 8 years ago
Is this a number or just a string of digits?  If you're never going to do 
calculations with it, you are much better off telling Refine not to convert the 
string of digits to a number in the first place (ie turn "auto-detect value 
types" off when creating the project).  Once it's done its conversion, there's 
no guarantee that it will be reversible since it's using floating point 
representation.

Perhaps we should have this off by default since it seems to cause a number of 
people trouble (and it's slowwww).

Issue 243 discusses this same topic.

Original comment by tfmorris on 27 Nov 2010 at 3:02

GoogleCodeExporter commented 8 years ago
I just tried that. It seems "auto-detect value types" checkbox is ignored since 
even when loading a new file, with that checkbox NOT checked, the "numeric" 
fields are green and setting Text Facet on the tax ID number turns them to 
scientific notation. I also tried turning off both checkboxes ("split into 
columns" and "auto-detect value types"), no improvements, tax ID number is 
still green and still parsed as number.

Original comment by shinhan....@gmail.com on 29 Nov 2010 at 7:17

GoogleCodeExporter commented 8 years ago
Weird, I just tried with "auto-detect value types" off and it behaved as 
expected.

Shihan, do you have a sample data set that's giving you trouble and that we can 
test?

Original comment by dfhu...@gmail.com on 2 Dec 2010 at 6:52

GoogleCodeExporter commented 8 years ago
Attached is the file I've been testing this on, all of the data is from the 
public government database.

I just noticed that converting to .csv (if I turn off the auto detect value 
types and add ";" in the split columns by) works fine.

But when I open the .xls directly, even though the "PIB" field is formatted as 
Text and "Auto-detect value types" is off, it parses it numerically.

Original comment by shinhan....@gmail.com on 2 Dec 2010 at 8:58

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 4 Feb 2011 at 3:24

GoogleCodeExporter commented 8 years ago
Excel is the one making the determination as to whether the cells contain 
numbers or text.  As far as I can tell Refine is importing exactly what Excel 
gives it.

Attached is a copy of your spreadsheet with the first three numbers in the PIB 
column converted to Text instead of numbers.  These correctly import into 
Refine as text.

Part of the confusion may lie in the way Excel works.  Changing the formatting 
of a cell or column doesn't automatically coerce the contents to the new type.  
Often this won't happen until you edit the cell value.

Please review and let us know whether we can close this.

Original comment by tfmorris on 8 Jun 2011 at 12:13

Attachments:

GoogleCodeExporter commented 8 years ago
So, there is still no solution to importing something like this? Editing every 
value in order to coerce the type change is not a solution. 

Original comment by shinhan....@gmail.com on 9 Jun 2011 at 6:06

GoogleCodeExporter commented 8 years ago
You're really asking an Excel question and I'm not enough of an expert at Excel 
to give you a definitive answer, but this article 
http://office.microsoft.com/en-us/excel-help/three-ways-to-convert-numbers-to-te
xt-HA001136619.aspx has some good suggestions.  In particular that TEXT() 
function looks like it'll do what you want.  This is similar to your previous 
CONCATENATE() solution, except it doesn't require the extra step after 
importing to Refine.

Original comment by tfmorris on 9 Jun 2011 at 8:25