vega / datalib

JavaScript data utility library.
http://vega.github.io/datalib/
BSD 3-Clause "New" or "Revised" License
731 stars 133 forks source link

Parse numbers with , or . #45

Closed domoritz closed 8 years ago

domoritz commented 8 years ago

Datalib can parse some numbers such as

$ dl.number('100.00')
100
$ dl.number('100,00')   // common in europe
NaN

But infer does something different

$ dl.type.infer('100')
integer
$ dl.type.infer('100.00')
string
$ dl.type.infer('100,00')
string

So datalib can parse something as a number although infer does not infer it to be a number (this is understandable).

I'm aware that it is a bit problematic to parse . and , since 100,000 could be either 100000 or 100 but maybe inferAll can make a smarter prediction.

Messytables does those predictions very aggressively and I'm not sure that we want to go that far. However, we should be at least able to parse csv with correctly.

a
"100.00"
"42.10"
"7.62"
"41.71"
dl.csv({url: 'https://dl.dropboxusercontent.com/u/12770094/simple.csv'}, {parse: 'auto'})

parses as strings unless you force it

dl.csv({url: 'https://dl.dropboxusercontent.com/u/12770094/simple.csv'}, {parse: {a: 'number'}})

I'm mostly asking for advice whether datalib should handle this case at all or whether I have to change my code to replace , or . by hand or even write a smart type prediction.

jheer commented 8 years ago

dl.type.infer assumes an array as input. If you give it a string, it will test each character separately, hence the behavior you're seeing. If given an array, the inference works as expected:

> dl.type.infer(['100', '3.14', '1e5'])
"number"

However this also suggests that non-array inputs should either be auto-boxed into arrays or result in an error. I'm sure others will run into this problem!

As for commas, the JS number cast does not support it (+'1,000' -> NaN). We currently use that as part of our number checking and parsing routines. Note that parseFloat is not much better (Number.parseFloat('1,213') -> 1), so that does not seem worthwhile as a replacement.

domoritz commented 8 years ago

I noticed that I made a mistake in the csv parse example.

dl.csv({url: 'https://dl.dropboxusercontent.com/u/12770094/simple.csv'}, {parse: 'auto'})

does in fact do the right thing and parses numbers. I accidentally left a , in the data.

This sounds like a great compromise. Parse arrays of numbers with . such as 100.32 correctly but fail to parse an array with numbers like 100,000.42 and require the tool that uses data lib to remove them first. For polestar/voyager this means that we ask users to clean up their data first for now.