okfn / messytables

Tools for parsing messy tabular data. This is now superseded by https://github.com/frictionlessdata/tabulator-py
http://messytables.readthedocs.io/
387 stars 110 forks source link

Decimal places are always truncated, because Integer has higher default weight #92

Open ThrawnCA opened 11 years ago

ThrawnCA commented 11 years ago

Fields with decimal places can still be parsed as integers, so both Decimal and Integer achieve perfect scores in type_guess. However, Integer has higher default weight, so the decimal places will be dropped.

This is a problem in data such as

https://staging.data.qld.gov.au/storage/f/2013-09-11T04%3A22%3A59.234Z/qscd-datafile.xls

where Latitude and Longitude will be rounded off (and thus become almost useless, because fractions of degrees are extremely important).

Should Decimal have higher default weight? Or, to keep Integer meaningful, should there be some way of distinguishing whether a field actually had decimal places or not?

ThrawnCA commented 11 years ago

By the way, we have manually patched messytables/types.py on our system to swap the guessing_weight of IntegerType and DecimalType, so the Latitude and Longitude now display correctly. However, the issue remains valid.

domoritz commented 11 years ago

I'm not sure I understand the problem correctly. int('13.223') will raise a ValueError and thus the integer type will not be chosen. Does the problem still appear when you use strict=True?

ThrawnCA commented 11 years ago

If you download the resource linked above, and upload it into another CKAN instance with a datastorer running (or link to it from another CKAN), the datastorer will interpret the Latitude and Longitude fields as type Integer, dropping all decimal places.

I believe that 'strict=True' is the default and is being used.

domoritz commented 11 years ago

Hmm, I can't look into the details at the moment but by looking into the source code of the integer type, I would say that it should be rejected. We should try to find a minimum breaking example that only uses messytables but I don't have the time at the moment. @rossjones Could you look into this?