okfn / messytables

Tools for parsing messy tabular data. This is now superseded by https://github.com/frictionlessdata/tabulator-py
http://messytables.readthedocs.io/
387 stars 110 forks source link

messytables guesses wrong type for decimal number #190

Open wrinklenose opened 4 years ago

wrinklenose commented 4 years ago

Describe the bug Messytables should guess decimals correctly respecting the locale configuration. For example: In germany the , is used as decimal dot but a value 1,200 is guessed as type "text".

This issue was initially reported as ckan issue https://github.com/ckan/ckan/issues/5769 where I recognized it.

The type guessing seems to happen here: https://github.com/okfn/messytables/blob/51b736892a48e420ab313675f54901c77b446dec/messytables/types.py and seems to happen locale specific. (I think the magic happens in line 100: value = locale.atof(value)

Unfortunately python seems to recognizes a dot as decimal point even if a german locale is set, which I could reproduce in my local environment:

>>> locale.getlocale()
('de_DE', 'cp1252')
>>> locale.atof('1,200')

Traceback (most recent call last):
  File "<pyshell#35>", line 1, in <module>
    locale.atof('1,200')
  File "C:\Program Files\Python27\lib\locale.py", line 318, in atof
    return func(string)
ValueError: invalid literal for float(): 1,200
>>> locale.localeconv()
{'mon_decimal_point': '', 'int_frac_digits': 127, 'p_sep_by_space': 127, 'frac_digits': 127, 'thousands_sep': '', 'n_sign_posn': 127, 'decimal_point': '.', 'int_curr_symbol': '', 'n_cs_precedes': 127, 'p_sign_posn': 127, 'mon_thousands_sep': '', 'negative_sign': '', 'currency_symbol': '', 'n_sep_by_space': 127, 'mon_grouping': [], 'p_cs_precedes': 127, 'positive_sign': '', 'grouping': []}
pazepaze commented 3 years ago

Using locale.atof seems to be system dependent.

On my ubuntu 20.04 this seems to work:

locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
'de_DE.UTF-8'
locale.atof('1,200')
1.2

It doesn't work when running on an alpine image in docker though, see https://stackoverflow.com/questions/61761085/python-locale-not-working-on-alpine-linux

Is there maybe some other way to do this that is less system dependent?